Data Matters: Creating the Perfect Health Research Cohort

Contributed Commentary by Dr. Gai Elhanan, MD, MA, Data Research Scientist, Center for Genomic Medicine, Desert Research Institute 

May 12, 2023 | Back in the summer of 2020, Northern Nevada residents were experiencing stifling smoke resulting from the Western US wildfires, while at the same time COVID infections were rapidly spreading. Research scientists at my organization wanted to understand the human health effect from large scale wildfire smoke, particularly in relationship to COVID infections. Using a cohort built from the clinical data repository of the large regional medical center (Renown Health) and existing population datasets, we found an association between the wildfire smoke event and an almost 18 percent increase in COVID infections

Research studies and findings like these can help people make informed decisions and take preventive measures or seek treatment earlier and ultimately improve health and possibly save lives. But such actions aren’t possible without truly repeatable and defensible cohorts. Research scientists know that an important foundation of research is cohort creation; many critical decisions are made based on how the cohort is built and designed. Better cohort creation can lead to more accurate findings and better care outcomes. But how do you build a better cohort? It all starts with the research question and the data at hand. 

Strong data: the linchpin of a good cohort 

In the healthcare sector, we have become very good at collecting large amounts of data, most of it in discrete form and readily accessible in electronic format. Such data are central to defining and creating any research cohort. As researchers, we believe that the quality, relevance, and accuracy of that data is paramount to creating a useful and usable study cohort—the difference between valuable research outcomes and vague, unreliable results. 

The reality is that health care researchers and regulatory agencies need a more comprehensive view of individuals and populations to make confident decisions about health care and policy. Clinical trial data, electronic health records (EHR), claims data and adverse-event reports are only snapshots of patients at random points in time.  

To provide real value, a holistic patient profile is necessary to achieve the greatest effect on health and wellness. By better understanding how genetics, environment, social factors, and healthcare interact, we can help predict who may be at greater risk—allowing for quicker diagnoses and the development of more precise treatment. 

Data Collection: Effectively Tapping the EHR 

While data feed the creation of cohorts, it should not be a write-once then forget effort. Determining and collecting the data is a task for all healthcare constituents, including the patients. In turn, those data draw a true picture of the patient’s health status and is an ongoing commitment, particularly when it comes to environmental, social, and economic determinant data. Healthcare providers must understand the value of such factors in assessing long-term outcomes for both the patient and the community. Research use of those data require judgment as well, given the reluctance of some patients to give such information in the wake of highly publicized data breach incidents.  

There’s a role for EHR vendors as well. Patients report on risk factors and predispositions of a specific nature; there should be a standardized way to incorporate such reporting in the EHR. Despite standard ontologies and clinical data exchange standards, every product is different, and every implementation is different. Vendors are protective of their intellectual property, and not necessarily open to sharing that in the interest of collaboration beyond the basic, mandatory requirements. Nonetheless, cohort development and studies must be system-agnostic, and EHR systems must provide accessibility to the data they hold regardless of the client program. 

Data mining: insights to improve care outcomes 

Fine tuning a research cohort can be challenging—despite our access to large amounts of data. Retrieving the right data from an EHR and generating knowledge out of it is a complex process and different for each system. Designing the cohort and choosing the qualified patients for a study, is a complex and tedious process. 

At the Healthy Nevada Project, one of the largest community-based population health study in the US and part of the Center for Genomic Medicine at the Desert Research Institute, we’ve collected DNA samples from more than 50,000 participants. For 80% of those, we also have comprehensive clinical data. The challenge of integrating large volumes of patient data from a variety of disparate sources is an ongoing commitment that relies heavily on the quality of data mined from partner systems with possible involvement of machine learning (ML) and artificial intelligence (AI) methodologies for both the integration, and the analytical components. 

AI and ML excel at complexity and tedium simultaneously and on a scale humans can’t hope to reach. We rely on advanced analytics from our partner SAS to drive the process of drawing value from our large clinical data repository. Mining these datasets not only gives us a picture of the health of the cohort, but how to build the right cohort for current and future research projects. 

One thing is certain about data: There will always be more of it. There will be new sources to integrate, new mechanisms to collect them, and new technologies to store and analyze them that will bring us ever closer to real-time processing. We will be able to better define cohorts, identify and recruit patients faster and more efficiently. With deeper analysis and more comprehensive reporting, we will approach the goal of truly personalized treatment.  


A veteran physician (Internal Medicine and Infectious Diseases), Gai Elhanan also has more than 25 years of experience with healthcare information systems including research, design, development, and implementation in clinical and administrative environments. Additionally, he has a formal medical informatics education with a broad informatics skill set. Dr. Elhanan has unique knowledge in the field of semantic networks and medical/healthcare ontologies. He has considerable experience in healthcare data analytics combining environmental and genomic data, healthcare informatics and medical terminologies. Gai received his M.D from Tel Aviv University and his M.A. in Medical Informatics from Columbia University. He also completed a NIH post-doctoral fellowship at the Medical Informatics Department, New York Presbyterian Medical Center/Columbia University. He can be reached at