CLARK v1 is a machine-learning classification software created by NC TraCS and CoVar Applied Technologies to enable computable phenotyping in unstructured data. CLARK’s user-friendly interface makes natural language processing (NLP) an accessible option for searching free-text clinical notes. This page includes user instructions and technical documentation for CLARK v1.
For instructions on CLARK v2, go here.
For a conceptual guide to CLARK including research applications and interpretation of results, go here.
Both CLARK v1 and v2 are free and available for download here.
Getting Started
System Requirements
Installation
CLARK: Basic Steps
Navigation
Loading and Saving Progress
Key Concepts
Clinical Notes
Formatting
Metadata
Free Text
Regular Expressions
Basic Regular Expressions
Section Break
Clinical Examples
Training Corpus
Loading the Training Corpus
Troubleshooting
Features
Algorithm Setup
Training Corpus
Regular Expressions Library
Active Regular Expressions
Sectioning
Notes
Patients and Notes
Note with Additional Markup
Using the Notes Viewer
Algorithm
Algorithm Steps-Training Corpus
Algorithm Steps-Evaluation Corpus
Machine Learning Classifiers
Cross-Validation
Explore
Distribution by Labels
Confidence
Filtered Records
Evaluation Corpus Results
Exporting Results
Sensitivity and Specificity
Technical Appendix
Cross-validation
Algorithms in detail
General Troubleshooting
CLARK runs best on Windows machines with 16 GB of RAM, and does not require special infrastructure to operate. Processing may take longer with 8GB of RAM.
Using the Animal Dataset CLARK comes with a few example files to practice with: AnimalCorpus_V2.json, AnimalCorpus_asMRN.json, AnimalKeywords.json, and AnimalExpressions.json. The AnimalCorpus_V2.json and AnimalCorpus_asMRN.json mimic labeled clinical notes, where each animal represents a patient. In AnimalCorpus_asMRN.json, IDs are represented with medical record numbers (MRNs) instead of animal names, and more metadata fields are included. This documentation uses examples from AnimalCorpus_V2.json. Both files can be used in CLARK and contain the same “notes.” AnimalKeywords.json contains input for the Regular Expressions Library, and AnimalExpressions.json can be uploaded to the Active Regular Expressions tab in CLARK.
Generally, CLARK can be used in the following steps:
Form a classification question, then identify groups and group-defining features of interest through a literature review.
Select a training (“gold standard”) corpus and evaluation (unlabeled) corpus of clinical notes. Gather, process, and load clinical notes into CLARK on the Training Corpus page.
Format features (words, phrases, or values) as regular expressions to match text in the body of clinical notes.
Iteratively train and assess algorithms and combinations of features until satisfied with CLARK’s performance on the labeled training data.
Transfer algorithm of choice to the unlabeled evaluation corpus.
Review, interpret, and use classification results in context.
Saving a session preserves regular expressions in use, clinical notes in use, and current algorithm results. Sessions are saved as .zip files. Previously saved sessions can be loaded from the More menu or on the home screen. There is no need to un-zip a saved session to load it into CLARK again. The contents of the .zip file can be viewed locally, but the corpora and regex files are compressed into a single .json. Regular expressions created/updated in CLARK can be saved individually from the Training Corpus page. To save the results of several algorithms for one set of clinical notes, save the sessions separately.
When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.
Clinical notes contain free-text patient data ranging from family history to current symptoms and imaging results. A set of clinical notes is called a corpus. CLARK requires two different corpora for a complete classification: the “gold standard” training set with known patient classifications (labels), and the unlabeled “evaluation” set. Machine learning algorithms are trained first using the gold standard data, then applied to the unlabeled evaluation set for classification. For use in CLARK, these notes must be converted to a specific format that renders a whole set usable and searchable.
Creating Gold Standards
Researchers can build a gold standard corpus through manual chart review for a subset of the patient population, leaving the rest of the population unlabeled for the evaluation corpus. Each patient should have one distinct label, and patients of the same group should have identical labels. For example, CLARK recognizes “diabetic”, “diabetes”, and “T1D” as different groups; however, these terms indicate the same characteristic and should share a common label in the corpus.
Gold standard labels provide CLARK’s classification options for the evaluation corpus, so all possible groups should be represented in the training data. The size of a gold standard set depends on several factors: analysis goals, rarity of condition of interest, population size, and desired confidence level for classifications.
in which a single note is enclosed in { } brackets, and the whole corpus is enclosed with [ ] brackets and fields are separated by commas. For clarity in this example, each field is on a separate line–this is not necessary in practice, as long as entries are properly separated with punctuation. The name and content of each field is enclosed with quotations, and a delimiting colon populates the field with the content.
All fields other than “note” will populate CLARK as structured metadata. Here is a sample script for formatting clinical notes in R.
The Metadata header for notes in CLARK includes limited structured data for each subject in a corpus. Metadata is visible in the Note with additional markup pane. MRN, noteID, noteDate, and label are required. Other metadata fields, such as gender and age, are optional, and some notes may have these extra metadata while others do not. The Animal Corpus example includes the following fields:
Field | Description | Example | Required? |
---|---|---|---|
noteDate | The date and time of the clinical note. This distinguishes several patient notes from each other. | “2016-11-13 19:45:00” | Yes |
note | The body of a clinical note | free-text formatting | Yes |
noteID | Distinct note ID, possibly including several per MRN. | “1” | Yes |
label | Classification label for a patient. Each distinct label in a training corpus is a possible group for patients in the evaluation corpus. | “mammal” | Yes* |
MRN | Patient medical record number or other identifier | “Aardvark” | Yes |
noteCount | Counter of notes within a patient | “1” | No |
The “note” field is the only non-metadata field in a corpus, and this is where CLARK uses NLP to search for matched features. Text in the note fields may be just that: a block of text. However, sections from clinical notes can be preserved through syntax in .json files. Since notes can be viewed within CLARK, sectioning may be worthwhile for readability. The following note has sections that are distinct in CLARK, and these can be excluded because the headers are specified in the note and then recognized by regular expressions.
A regular expression (regex) is a string of letters or numbers that uses additional special characters to define a search pattern in a body of text. The purpose of using regular expressions in CLARK is to robustly identify features (words, values, or phrases) in clinical notes that distinguish patient groups of interest from one another.
Writing regular expressions does not require any text-mining expertise and can be learned through online tutorials. Regex101.com provides a useful interface to practice using regular expressions, and RexEgg.com includes an in-depth tutorial to help develop more complicated expressions. Note that CLARK uses python “flavored” regex. This is generally the same as other regex flavors, and more information can be found here.
General Regular Expression Examples
RegEx | Purpose | Example | Matches |
---|---|---|---|
“(?i)” | Searches case-insensitive | “(?i)heart” | “HEART”,“Heart”,“heart” |
“\b” | Encloses any word or phrase | “\bhigh BP\b” | “high BP”, will not match “high” |
“[abc]” | A single character a,b,c | “arm[sy]” | “arms”,“army” |
“\d” | Single digit 0-9 | “\d\d\d” | “456” or "333 |
“\D” | Anything not a digit | “\D\D\d” | “AB1” |
“.” | Any character except line break | “…” | “abc” “a c” |
“*” | Zero or more times | “a*h*” | “ah” “aaaaahh” |
“\” | Some characters have special functions in regex. To search for the literal character as text, precede with a backslash. | “1\.5\+1\?” | “1.5+1?” |
More examples of regular expressions can be found in the AnimalKeywords.json and AnimalExpressions.json files that come with the Animal Corpus example. Note that in the .json files, the backslash should be duplicated any time it is used. For example, the word boundary “\b” should be entered as “\\b” in a .json file.
Once notes are loaded into CLARK, Sections are identified using the “Section Break Regex”, found under the Section Definitions tab on the Training Corpus page. This regular expression recognizes any word (requiring user input) functioning as a header. Altogether, it matches any number of consecutive #, followed by any number of consecutive characters that are not a space.
Instructions for including or excluding specified sections can be found here.
There are endless possible regular expressions that can match words and phrases in clinical notes. Below is a sample of features that could be used to classify some common conditions.
Clinical Regular Expression Examples
Name | RegEx | Matches |
---|---|---|
Obesity | (?i)\b(?<!not )obes(ity|e)\b | “Obese”,“obesity” |
Tobacco | (?i)\b(?<!non-)(smok(ing|er|es))|(tobacco)|(cigar[est]{0,5})\b | “tobacco”, “smoking”, “smoker”, “cigars”, “cigarettes” |
Coughing | (?i)\b(cough[ings]{0,3})|(wheez[inges]{0,3})\b | “cough”, “coughs”, “wheezing”, “wheezes” |
High Blood Pressure | (?i)\b(high\s(bp|blood pressure))|(hypertension)\b | “high BP”, “hypertension”, “high blood pressure” |
A CLARK analysis begins on the Training Corpus page. Here, the user loads data (notes and metadata) to be analyzed along with regular expressions.
What is a training corpus? The training corpus consists of “gold standard” data: a set of subjects’ clinical notes with their true labels included. Labels, or groups, are defined in context of the classification question. This set is used to build and assess an algorithm before attempting to classify patients of unknown groups. Training an algorithm is an iterative process, taking several rounds of feature selection and model performance evaluation. The Animal Corpus example that comes with CLARK includes gold standard data for animals of 5 classes: mammal, bird, fish, insect, and reptile. The features selected for this classification (defined in the example regular expressions files) help to distinguish the 5 groups.
The training corpus can be loaded on CLARK’s home screen by selecting “Load a Corpus.” A file explorer pops up, then users may navigate to the location of the training corpus and double-click on a .json file to upload it. A green check appears once the notes are successfully loaded, and then they appear in the “Note with Additional Markup” pane.
If the training corpus or regular expressions files are not correctly formatted, CLARK will present this error: “Failed to Load Regular Expression Library”. To fix it, pay close attention to the structure of the example animal corpus materials and check that the following criteria in the .json files are satisfied:
Here, users can upload a different training corpus without restarting a CLARK session from the home screen. Similarly to on the home screen, a file explorer pops up to select a “corpus.json” from.
The RegEx Library stores regular expressions to be used in different classifications. These may be included or suppressed throughout the feature selection process. Regexes may loaded from a file or added directly from the CLARK interface. CLARK accepts a .json file, and its structure differs slightly from those uploaded to the Active Regular Expressions.
Similarly to entries in the RegEx Library, Active RegEx can be added, deleted, edited, arranged, and exported locally. The “X” icon deletes an active feature, and it will no longer be used by the algorithm or highlighted in clinical notes. If a deleted active feature calls on a feature in the RegEx Library, it still appears in the Library. There are three ways a user can add regular expressions to Active Regular Expressions.
1. Load a .json file containing regular expressions
To load a .json file into Active Regular Expressions, click on the folder icon in the Active Regular Expressions tab under Algorithm Setup. An “Active RegEx.json” file can include new regular expressions or reference the RegEx Library. The regular expression and the name must be on different lines, and each must be specified with “expr” or “name”. The “expr” line includes the regular expression with special characters, and the “name” line includes the label for that feature.
2. Enter regex directly into CLARK interface
3. Add from Regular Expression Library
Differences between Active RegEx and the RegEx Library:
The .json file uploaded to the Library has one line per feature. To include these features in analysis, the Active Regex file must reference the desired features by name. To exclude an active regular expression from analysis, it must be deleted from the Active RegEx pane. To exclude a feature that is defined in the Library, it can remain in the library pane as long as it is not called by “#feature” in the Active RegEx pane.
Entire subjects or standalone clinical notes cannot be excluded once loaded into CLARK. However, sections of a clinical note can be excluded across the corpus. This is done using regular expressions in the Section Definitions tab. Sections, if included, are separated by headers in the free-text clinical notes within the corpus. They should already be defined in the corpus.json file before uploading to CLARK. This portion of the documentation explains how to make CLARK recognize and select sections. Instructions for defining sections in clinical notes can be found here.
By default, all sections of a note are searched for features to be used in classification. To match a section with regex that can be included or excluded, add a row to the Section Definitions tab using the “+” button. Type the exact header of the section under the REG. EXP box between the word boundaries, “\b and \b”.The section specified will be highlighted in red if not in use, or blue if in use. To toggle section usage, double-click on the box under IN USE and select from the drop-down list. Sections excluded from the training corpus will also be excluded from the evaluation corpus.
When to Use Sectioning
If users expect certain information is irrelevant to classification, sectioning can come in handy. For example, conservation efforts for animals may not tell us anything new about how they are classified. In clinical notes, general family history might not reveal anything new about a condition among all the other information provided. Excluding sections of clinical notes can also speed up the algorithm when there is a large volume of notes or patients to process.
This pane includes three tabs, allowing users to explore notes with different views:
Note: Before classification algorithms can be selected and tested on the training corpus, the Active Regular Expressions section in the Features tab must be populated.
Note: Before loading the evaluation corpus into CLARK, explore results from the gold standard, or training, data. Once satisfied with a model’s performance on the training set, it’s time to apply the algorithm to an evaluation corpus.
CLARK employs classification algorithms created by SciKit Learn, a resource for machine learning in Python. A detailed description of each algorithm can be found in the technical appendix and at SciKit Learn.org
The Linear SVM algorithm uses features to ‘draw’ lines between data points to separate them into classes. In CLARK, patients are grouped based on what their true label is in the training corpus, and then their features are associated with these groupings to use in the evaluation corpus.
The Gaussian Naive Bayes classifier uses the probability of each feature belonging to a group to predict the most likely label for each patient in the evaluation corpus. This algorithm assumes that features are independent of each other: that the presence of one feature does not imply the presence or absence of another.
The Decision Tree algorithm classifies patients iteratively into subgroups based on the presence of features. The first “split” in a decision tree uses the most polarizing feature to distinguish into subgroups, then each subgroup is split by the most distinguishing feature, and so on until each subject is labeled. For example, a polarizing feature when classifying animals would be “feathers”. A decision tree would split the animal corpus into two groups: those with matches to the “feathers” regular expression and those without.
The Random Forest algorithm employs the use of many decision trees to classify items into groups. Random samples are taken from the population, and subjects within each sample go through a decision tree with a random sample of features. The same subject appears in several different decision trees, and the algorithm selects its label with the most “votes” from all the decision trees. Decision trees within a random forest do not split samples based on the most polarizing feature, but rather a random feature at each split. This creates diversity between the trees and decreases dependence on a single important feature.
The Explore page is accessible after running cross-validation on a training set, or after an algorithm is transferred to the evaluation corpus. Trying several different algorithms and cross-validation methods in the training corpus will help in selecting a model to classify subjects in the unlabeled evaluation corpus. The explore page contains information about classification accuracy, CLARK’s confidence in its labels, and the distribution of labels among the study population.
The user may also select the group labeled as birds from the chart on the right to see how many truly are birds or of another classification. This investigation is helpful in both directions. If 45 animals in the training corpus are truly birds, and 45 are classified as birds, some may still be mislabeled.
When interpreting results grouped by confidence, it is important to remember that CLARK should not perfectly label each subject in the training corpus; CLARK is meant to supplement human effort and help to decrease time spent on manual chart reviews. However, it may be wise to determine a confidence level at which patient notes will be reviewed manually.
When is it time to use the Evaluation Corpus?
After several rounds of training, users may gain a sense of which performance is relatively ‘good’ based on confidence estimates and the proportion of correctly classified subjects in the training data. Given a thorough literature review, researchers may pre-specify the proportion of correctly classified subjects and corresponding confidence levels necessary to move from training to evaluation.
The Filtered Records dataset lists individual subjects who fit into the selected label or confidence range selected in the graphics above. If no confidence ranges or classifications are selected from the above section, all subjects are listed. Clicking on a cell in any subject’s row brings up a window that displays the highlighted features in their clinical notes; this may help indicate why CLARK grouped them correctly or not. Clicking on any column header toggles sorting in ascending/descending order.
Navigating Filtered Records
Where Misclassified=“No,” the True Label and Classifier Label columns will have the same value. Max Conf contains the maximum confidence level CLARK computed for a subject’s label. Each label’s column contains CLARK’s confidence level that a given subject belongs in that group, and the largest confidence value in a row will populate the Max Conf column. This corresponds to the label in Classifier Label.
For example, CLARK was 20% confident that the finch is a fish, and 80% confident that the finch is a bird, so it was correctly classified as a bird. However, CLARK was 78.89% confident that an anglerfish is a mammal, so it was misclassified. Finding misclassified subjects at different confidence levels can reveal if features or the algorithm need to be updated.
When updating a model or list of included features, be careful of overfitting. CLARK is not intended to perfectly classify the training set.
In this example, CLARK has a confidence of less than 50% for 29/114 unlabeled subjects. This does not necessarily mean that they are all misclassified, but relatively low confidence levels may warrant a manual review or algorithm adjustment.
By clicking “Export Data” in the upper-left corner of the CLARK environment, the dataset shows in Filtered Records can be saved locally as a .csv. These files can be opened in Excel or a different software for more detailed evaluations of the classification model’s performance.
When using CLARK to explore clinical data, be mindful that the saved session will contain identified protected health information (PHI). The session should be saved in a location approved at your institution for storing PHI.
CLARK offers stratified and random cross-validation to use on the training corpus and help prevent overfitting.
CLARK Under the Hood
Before employing a chosen machine learning classifier, CLARK transforms clinical notes into a multi-dimensional feature vector based on the number of user-defined features (recall that a regular expression can match several different words). The number of matches for each regular expression are calculated for each sentence, then the vector of match counts is summed across all sentences in a note. The vectors are then summarized by patient by calculating the mean feature vector across all of the patient’s notes (this is why metadata fields in clinical notes are important). Once the features have been summarized for each patient, the selected algorithm uses this patient-level information to train a model.
Decision trees classify subjects based on the presence of certain features. Classification begins at the ‘root node’ containing the whole population, and points are classified stepwise by features. The algorithm makes a distinction at a ‘split’ for each feature. The first split employs the most polarizing feature of the sample, and further splits are then the next-most-distinguishing features of the resulting sub-groups at ‘child nodes’.
The goal of each split is to create subsets as different from each other as possible, with the subjects in each resulting subgroup as similar as possible. The ‘leaf’ at the end of a decision tree run is the resulting classification, with each point in the population classified in a group. The decision tree includes every subject and every feature specified in the algorithm. In cross-validation, the subsamples’ trees include every feature but not every subject.
The probabilities returned by SciKit Learn decision trees are the number of observations in a given class divided by the number of observations captured in that correct leaf during model training.
Decision trees are susceptible to overfitting when too many features are specified in the training set. To help mediate this for use in CLARK, consider using dimensionality reduction and select subsets of RegEx-specified features. Maximum depth describes the largest number of nodes acceptable for the tree to grow. CLARK does not specify a max depth; by default, nodes are expanded until the leaves are either purely separated or all leaves contain fewer than the minimum number of samples.
The random forest is an ensemble method using several decision trees in classification. The decision trees created in a random forest differ in a few ways from the decision trees described above. Rather than using the most polarizing features at each split, the algorithm selects a subset of features at random for each tree. This creates diversity between the trees. Additionally, random bootstrap samples of the training population are used in the random forest, allowing the same patient to appear in several different trees. Bootstrap aggregation also contributes to model diversity.
As each subject appears in several different decision trees, one may be assigned more than one classification in the entire forest. The model’s decision for a subject’s classification is the leaf in which the subject appears most often, or the classification with the most ‘votes’ from the forest. The appeal of a random forest is the uncorrelated nature of the individual trees within. Bootstrap aggregation and random feature selection contribute to this and ensure that the trees do not err in the same direction. For classifications to be accurately predicted, the selected features (words and values found by regex) do need to hold some predictive power. Similarly to the decision tree algorithm from SciKit Learn, the random forest in CLARK uses default attributes for maximum depth.
The Linear Support Vector Machine (SVM) defines a hyperplane in n-dimensional space to separate data points (in context of CLARK, patients) into different groups. The SVM algorithm ‘draws’ a n-1 dimensional hyperplane between points with the maximum distance between groups in the training set. For example, a 1-dimensional line is drawn to distinguish between 2 groups. CLARK uses SciKit Learn’s SVM classifier to enable probability estimates and use a linear kernel.
The linear SVM algorithm used in CLARK will employ nominal features, being the words found by regex in clinical notes. Dummy variables are created to represent each feature: presence or absence of a word or value, matching a user-defined regex, in patient notes. These features determine where a patient lies in the hyperspace, and the patient in the training set has an inherent label that the algorithm attempts to group together based on the presence of features in clinical notes.
After training, the algorithm applies the hyperplane boundaries defined on the training set to the evaluation corpus, so unlabeled patients are classified by support vectors based on where they lie in n-dimensional space, where n=number of features and their position is dictated by the presence of these features. Model parameters specify priorities of the SVM. There is a trade-off between perfect partition of points in the training set and smoothness of a line.
The Gaussian Naive Bayes classifier assumes that the likelihood of features appearing in a clinical note follow a normal distribution. This lofty assumption is the reason for “Naive” in the algorithm’s name. Given a true label, the algorithm calculates probabilities of each feature being associated with that label. The algorithm then infers true subject classifications by using Bayes’ theorem in the training set (hence the name of the algorithm).
For example, the probability of an unlabeled animal being a bird, given that it has wings, is derived from the probability that an animal known to be a bird has wings mentioned in its “clinical notes”. The goal is to find the best classification given data, or the most likely patient label given the features found in clinical notes.
Naive Bayes is robust to several possible classifications (more than two labels to distinguish) in a population. CLARK employs the default settings of Gaussian Naive Bayes from SciKit Learn.
Occasionally, CLARK will throw an error upon uploading corpora or regular expression files. First, be sure that all files going into CLARK are saved as .json. If the files still are not being loaded properly, it is most likely a formatting issue.
If features are not being highlighted in CLARK as expected, there is most likely a problem with some regular expressions.
In small sample sizes, or in the case of some rare labels among the training corpus,