Skip to Main Content

Long Covid, with its constellation of symptoms, is proving a challenging moving target for researchers trying to conduct large studies of the syndrome. As they take aim, they’re debating how to responsibly use growing piles of real-world data — drawing from the full experiences of long Covid patients, not just their participation in stewarded clinical trials.

“People have to really think carefully about what does this mean,” said Zack Strasser, an internist at Massachusetts General Hospital who has used existing patient records to study the characteristics of long Covid. “Is this true? Is this not some artifact that’s just happening because of the people that we’re looking at within the electronic health record? Because there are biases.”

One of the largest sources of real-world data on long Covid is a first-of-its-kind centralized federal database of electronic health records called the National Covid Cohort Collaborative, or N3C. Kickstarted as part of a $25 million National Institutes of Health award early in the pandemic, N3C now includes deidentified patient data from 72 sites around the country, representing 13 million patients and nearly 5 million Covid cases.


“If we are able to identify these sort of constellations of symptoms that make up these potential long Covid subtypes then, first of all, we might find out that long Covid is not one disease, but it’s five diseases or 10 diseases,” said Emily Pfaff, who co-leads the long Covid working group at N3C. The real-world data effort has garnered additional funding as part of RECOVER, the four-year NIH initiative to study long Covid, to more precisely characterize the syndrome.

That work has started to trace a clearer image of long Covid, most recently describing co-occurring clusters of cardiopulmonary, neurological, and metabolic diagnoses. But a firmer definition of the syndrome could also potentially support recruitment efforts for critical long Covid trials, some of which have been slow to make progress.


“There’s a concern that trials relating to long Covid are going to not be that successful,” said Melissa Haendel, a health informatics researcher at the University of Colorado Anschutz Medical Campus and co-lead of N3C, because its definition is still so diffuse.

Supporting more targeted recruitment is what Pfaff calls the project’s “sweet spot.” She and her colleagues hope that machine learning models could help identify potential participants who would otherwise be missed or underrepresented in prospective research. And by using algorithmic approaches to narrow down a cohort of people who are more likely to have long Covid, said Pfaff, “a research coordinator who’s making calls to potential participants is making calls from a list of 200 patients, rather than 2 million patients.”

That effort is still a work in progress. The team’s first stab at building an algorithm that could identify long Covid patients, released in a preprint now accepted at the Lancet Digital Health, had its limitations. At that point, “there was literally no structured way for a physician to enter ‘I think this patient has long Covid’ in their EHR,” said Pfaff. “We had to get creative and find a proxy.” They settled on records from about 500 patients who showed up at three long Covid specialty clinics.

The model performed decently when tested on records from a fourth clinic, differentiating between long Covid clinic patients and non-patients with a 0.82 area under the curve, a measure of accuracy used by machine learning researchers. But it was still based on a small number of patients that could be demographically skewed. And Pfaff pointed out the data might overrepresent long Covid patients with respiratory symptoms, because two of the clinics used for model training were based in pulmonary departments.

Since that round of work, medicine has found better awareness, if not necessarily a better understanding, of long Covid. In October, providers were finally able to track long Covid patients with a dedicated diagnostic code that “will be very important for recruitment,” said Lorna Thorpe, a co-investigator for RECOVER’s Clinical Science Core at NYU Langone Health. It can both provide a straightforward way to identify long Covid patients — there are 16,000 with the code in N3C so far — and help to develop a clearer definition of the syndrome.

“Eventually, the idea is to characterize the subtypes of long Covid that health care providers should expect to see in their clinics,” said Charisse Madlock-Brown, a health informatician at the University of Tennessee Health Science Center and co-lead for N3C’s social determinants of health team.

But the code could also be used to refine the next generation of N3C’s models, by teaching algorithms what to look for in electronic health records that could suggest a patient has long Covid — even if the code isn’t used.

“So much of getting a diagnosis of long Covid appears to have a lot to do with your access to care, as well as finding a doctor who even knows what long Covid is and is able to treat you,” said Pfaff. An algorithmic approach to recruitment could potentially help include patients who don’t have that access.

So now, the team is training models that learn from both clinic patients and those whose doctors have checked off the new diagnostic code, in the hopes of defining a “best of breed” classifier. When the group applied the latest version to N3C’s records, it turned up 158,000 potential long Covid patients, Pfaff said.

That’s not to say the model can or should be turned to patient recruitment immediately. Researchers both within N3C and the larger RECOVER initiative emphasize that algorithmic approaches are no silver bullet, and they’ll always need to be used in combination with human vetting to build study cohorts.

That’s because any skews in the data used to train a long Covid model could result in inaccurate predictions. And while N3C’s records have been cleaned up so they’re ready for analysis, “there are caveats to these data,” said Leonie Misquitta, whose clinical innovation team at the NIH’s National Center for Advancing Translational Sciences stewards the data platform. There are almost twice as many female patients with long Covid codes in the system than male patients — which could be a result of patient behaviors, coding practices, biological realities, or all the above. In a more egregious example, a clustering algorithm initially identified sexual activity as a comorbidity of long Covid because of the way one site documented its patients.

“I think this is an important approach. I’m super supportive of it, and we’re communicating that to NIH,” said Thorpe. “But it won’t be the perfect solution. Let’s be realistic. Recruitment’s going to increase, it’s going to get incrementally better, with all the different strategies that are applied.”

The N3C team will continue refining their models as more real-world data emerges. In particular, they’re interested in building a machine learning classifier that could identify long Covid patients with subtypes of the disease, like those suffering from new onset diabetes or certain types of kidney disease. “It may be easier to find people with the more common phenotypes,” said Jasmin Divers, another leader for RECOVER’s real-world data efforts at NYU Langone. “But if you wanted to fill a specific subset that you’re not seeing as often, then having that enriched pool to pull and recruit from could be beneficial.”

And critically, they’ll aim to test their predictions on new datasets as they roll in, seeing whether the results hold up across different health systems. “In medicine, the stakes are always high,” said Strasser. “I always err on the side of making sure things work correctly before and that things are really validated before we go ahead with using a technology like this.”

But while they acknowledge the limitations of real-world datasets and the algorithms trained on them, N3C researchers argue that using such models to identify trial cohorts is relatively low risk. “If somebody from a university were to be running a long Covid trial and asked me if I felt comfortable applying this model to help them make a potential recruitment list,” said Pfaff, “I would unequivocally say yes.” They could present certain recruitment sites with lists to follow up with, using a third party intermediary to protect personally identifiable information, or give them the code to run on their records internally to identify potential participants.

N3C leaders said the platform has been primed to support recruitment. Integrating the group’s EHR resources with clinical cohort identification was part of N3C’s initial proposals for RECOVER funding, but so far the NIH hasn’t funded that use of the tool. “The sort of framing initially of the work of the EHR cohorts was more a rapid strike: Let’s understand [post-acute sequelae of SARS-CoV-2 infection], let’s characterize it. It wasn’t in their contract with the NIH to do that,” said Thorpe.

“We have to wait for NIH to say yes, these are the things that we want you to prioritize and here’s the budget for those things,” said Haendel. “The recruitment sites and the data engineering team and N3C are ready to do such things, but there have to be resources and coordination.”

Create a display name to comment

This name will appear with your comment