Skip to main content
Emily Pfaff, PhD is Co-director of Informatics and Data Science at NC TraCS. She has a PhD in Health Informatics and a master's in Information Science, both from UNC. Her primary expertise and research interests are in "computable phenotyping" and clinical data modeling in support of translational research.
via @patrick_schneider on unsplash

InsideTraCS — with Emily Pfaff

| Marla Broadfoot

InsideTraCS: Get to know your extended research team through a new series featuring conversations with faculty and staff.

Emily Pfaff, PhD

Emily Pfaff, PhD is Co-director of Informatics and Data Science at NC TraCS. She has a PhD in Health Informatics and a master's in Information Science, both from UNC. Her primary expertise and research interests are in "computable phenotyping" and clinical data modeling in support of translational research.

Marla Broadfoot, NC TraCS science writer, recently spoke with Pfaff about improving/refining health data collection and analysis, COVID research, and the role of informatics in solving today's pressing health problems.


Last September, you finished your PhD in Health Informatics from the Carolina Health Informatics Program (CHIP). Congratulations, by the way! What drew you to health informatics?

I graduated from undergrad with a degree in Russian history—and you won't be shocked to hear that it was pretty hard to find a job in that field during a recession. When I decided to change course and go back to school, I knew I was interested in information science but was somewhat lacking in direction. Luckily, I ended up getting a graduate research associate position in Psychiatry helping to build study databases, and I was hooked; I loved the idea of using my computing skills to actually impact patient health. An internship at TraCS turned into a position at TraCS, which turned into … ten years later and a PhD. And I still love the idea that computing can impact health! So in a way, health informatics found me rather than the other way around.

What was your dissertation on, in layman's terms? Are you continuing this work in any way?

My dissertation focused on methods of "computable phenotyping," or how one translates inclusion/exclusion criteria into code in order to define a patient cohort, or group of study subjects. I experimented with a new method to help ensure that the code-based version of the cohort definition gets as close as possible to the definition of that cohort in the clinician's mind. As computable phenotyping is so fundamental to the field of clinical informatics, I am definitely continuing this work!

How is your role at TraCS changing, if at all, now that you have a PhD?

Somewhat different, somewhat the same. I co-direct the Informatics and Data Science service with Ashok Krishnamurthy, along with Associate Director Kellie Walters. We approach most things as a team (and have a simply awesome team of analysts, developers, and project managers that do so much of the heavy lifting), so in a way things are operating much as they always did. But, I will have some new opportunities to lead projects as a PI, which opens some doors. I'm excited to see what comes next.

I see that you are involved in the National COVID Cohort Initiative. Could you give me a sense of the scale of the project and what it is hoping to accomplish?

graphic of N3C data flow

N3C is an NIH-funded initiative to bring together massive amounts of electronic health record data for patients across the US, and then use those data (and that huge sample size) to answer important COVID research questions. Any researcher can request approval to use the data, which makes it a wonderfully open and democratic resource for doing data-driven COVID research. We currently have about 10 billion rows of data, for almost 9 million patients. Many aspects of working on N3C—especially all the wonderful new colleagues I've met—has made it one of the highlights of my career, but at the same time, the circumstances that led to its instantiation are tragic. I always strive to remember that the clinical data we are entrusted with are not just numbers and strings—those data represent real people, many of whom lost their lives in the pandemic. It's important to maintain that perspective.

One of the great mysteries of the COVID-19 pandemic has been the persistence of long-term symptoms in a significant proportion of people who tested positive for SARS-CoV-2. You wrote that "long COVID holds the potential to produce a second public health crisis on the heels of the pandemic itself," arguing that the characteristics of this new condition need to be carefully defined in order for it to be studied effectively. Could you explain how long COVID is a health informatics/data science problem?

This gets back to the concept of "computable phenotyping." In order to identify patients with a disease of interest, you need to create a definition for that cohort. If you need a large sample size of patients with that disease (thousands, or more), the most efficient method to find those patients is to convert your definition into code in order to find qualifying patients using their electronic health record data. Long COVID is no different; in order to study long COVID, we need to identify patients who have long COVID. Sounds obvious, but long COVID is a particularly tricky condition to define (so far), because its symptoms are wide-ranging, with lots of overlap with non-long COVID conditions (think anxiety, fatigue, and insomnia). Thus, finding a way to differentiate that cohort of patients using nothing but their health record data is a data science puzzle that is still very much unsolved.

What other pressing problems (pandemic related or not) do you think health informatics can solve?

Working on N3C has taught me that amazing things are possible when the informatics community comes together and agrees on data standards, and then implements those standards in their local data holdings. That may not sound earth-shattering, but think of it this way: if every US health care system structured their data in the same way, and enabled secure and compliant ways to share those data with one another, the boundaries of a single institution no longer limit the scope of data-driven clinical research. What could we accomplish if we could ask questions of the data for every Alzheimer's patient in the US? How would rare disease research be impacted if low sample size was less of a problem? The informatics community is absolutely working on this now, and major strides have been made already; but, we can and should go farther. I'd be hard-pressed to say that informatics will cure cancer, invent the next new therapeutics, or prevent the next pandemic—but it will absolutely serve as a critical support for the humans accomplishing those feats.

What else are you working on that you think we should know about?

Some of my most important work is tracking down the next big project to keep our brilliant team on their toes. We have lots of new things on the horizon: working with social determinants of health data, text mining, and building more tools to help UNC investigators work with clinical data. Stay tuned!

View all InsideTraCS articles

NC TraCS is the integrated hub of the NIH Clinical and Translational Science Awards (CTSA) Program at the University of North Carolina at Chapel Hill that combines the research strengths, resources and opportunities of the UNC-Chapel Hill campus, partner institutions RTI International in the Research Triangle Park, North Carolina Agricultural and Technical State University in Greensboro, and North Carolina State University in Raleigh.

View news related to policies and regulations

Have news or an announcement to share? Contact Michelle Maclay at michelle_maclay@med.unc.edu

Get NC TraCS events and news delivered to your inbox! Subscribe to our weekly email blast