What You Don’t know, Will Hurt You: Overcoming “Missingness” in Healthcare Data

Contributed Commentary by Ryan Leurck

August 6, 2021 | The increasing availability of data stored in electronic health records (EHRs) and other data sources, such as claims clearinghouses brings substantial opportunities for advancing patient care and population health. However, the potential for rich analytics is fundamentally dependent on the completeness and quality of the data. Unfortunately, most data sets—especially in healthcare—are missing data and, therefore, may not sufficiently representative to support the broader conclusions being drawn about groups of patients.  Missing values is a problem that data scientists refer to as “missingness.”   

Even if data isn’t specifically missing, often the quality of the data is so poor that it is unusable and functionally considered missing. This missingness often leads directly to poor analytics outcomes.

In one study, researchers sought to use EHR data to populate a risk prediction model for identifying patients with undiagnosed type 2 diabetes mellitus but found up to 90% missing data in some records. Attempts at imputing these missing data or removing incomplete records resulted in a major deterioration in the performance of the prediction model. Further, researchers detected a significant drop of performance even in cases where only one-third of records were missing or incomplete – highlighting the substantial wasted opportunities from missing data.

Good data is critical to good decision making—especially when it comes to using data for machine learning and business analytics. Traditional data mining approaches used to extract insights are designed to operate on complete data, and why many life sciences and healthcare organizations make important decisions based on flawed analytical models trained by incorrect and missing data. Quantifying and understanding the characteristics of missingness in one’s data is an essential first step toward high-quality analytics and more accurate outcomes.

How Pervasive is Missingness?

In a word: very. Data is now used to support decision-making in every aspect of drug development and commercialization. Data missingness has the potential to impact nearly every single decision and is one reason why only one in 5,000 drugs ever make it to market.

When companies purchase data sets, such as claims data or script data (at great expense), they intend to leverage it as real-world evidence for an accurate representation of market activity. However, fields are missing or incorrect. Missing data is caused by a variety of factors: human error, data processing errors, data collection errors, data sourcing errors, data use rights omission, and intentional suppression of data. Further, data formats in life sciences are often inconsistent so it may be too hard to integrate data sets and therefore not considered. Statistical missingness or “data asymmetry” that results from huge amounts of data points being distilled into even more data also introduces biases. Poor data quality and insufficient quantity compound the problem to produce a level of missingness that causes significant challenges.

Technology has been slow to catch up. Historically, clinical trials, claims reimbursement, and customer relationship management were all paper-based processes, followed by some standardization, and now many systems are electronic. The transition has been messy, decentralized, and ongoing—and the result is a lot of data but low in quality.

The sources of biases and missingness in common life sciences and healthcare data sets are primarily due to the underlying collection mechanisms themselves, even in the newer electronic systems. For example, claims and prescription drug data originate at the tens of thousands of healthcare providers and pharmacies in the U.S. each submitting electronic claims using one or more of a variety of different billing and practice management software platforms. Any errors or inconsistency in the configured settings at any practice or pharmacy can lead directly to systematic error and bias in the downstream integrated data sets.

Furthermore, the hundreds of thousands of billing and coding specialists operating these systems must learn to navigate the complexity and ever-changing nature of medical and pharmaceutical billing conventions and regulations. They must take great care to accurately input dozens of data points. The opportunities for human error are widespread, and any such errors further degrade the quality and utility of the final data to be analyzed.

Finally, HIPAA and other privacy-related regulations can have a very real impact on the quality and consistency of data sets. Any stakeholders in the “chain of custody” of the data from origination and collection, through to integration, packaging, and delivery may purposefully obscure certain data or remove records completely to minimize their perceived exposure to compliance and privacy risks.

All of these sources of bias conspire to make the challenge of obtaining high value decision support much more difficult. Because data vendors capture so many pieces of data and slice it down into more and more slivers of information, the missing information has radical implications on the level of bias innate in their decisions—and the decision-makers don’t even know why. Companies spend millions of dollars on data but may only receive 20% of the value due to missingness. This can translate into billions of missed opportunities.

What’s the Rx for Missingness?

One obvious, but somewhat naïve solution to this problem is to simply purchase more data. Many often do, but the cost can be prohibitive and there are always limits to what can be obtained. Making better use of the data one already has is often the only viable option. And rather than waste their investments in big data, companies are building expensive real-world data processing systems—one pharmaceutical company spent $80 million on a new platform to deal specifically with data cleaning and preparation.

Unfortunately, the added data cleaning and preparation work steals precious time and resources and distracts pharmaceutical companies from the life-saving drug development. In fact, compensating for missing data is one of the most time-consuming parts of a data scientist’s job in the life sciences and healthcare industries. Typically, a data scientist spends upward of 80% of his or her time in the data preparation and feature generation phases of the data analysis and machine learning processes.

In addition, all the data preparation delays the ultimate goal of obtaining business value from the data, whether it be generating physician preferences for commercial teams, insights that impact market access decisions, or patient recruitment for clinical trials. In fact, over the last two years, increased data complexity has driven data cleaning cycle times up 40%—ultimately attributing to delayed clinical studies.

Fortunately, the technological gap is closing. New AI-driven solutions attack the two primary issues: 1) Minimizing the amount of “missing” data by repairing and remastering records with errors; and 2) understanding what is missing so that missing data can be accounted for in analyses and model building.

Rather than start with raw data sources with their inherent problems like missingness, advanced data science platforms correct known classes of systemic errors and leverage the power of advanced machine learning and other methods to confidently fill in missing values and standardize messy data.  For example, for a missing data point in a patient’s lab results during her hospital stay, machine learning models rank other patients based on their similarities with the patient in terms of lab values and then the missing value is estimated as a weighted average of the known values of the same lab test from other, similar patients. 

Additionally, data vendors and analytics providers can assist in quantifying the missingness in data, often through the generation of coverage metrics that indicate the relative sufficiency of a data set for a specific use case. This is critical to building analytics and machine learning models with confidence and critical to good decision-making.

The problem is complex, but the solution is not. The first step is recognizing how impactful data missingness can be to decision-making followed by addressing it with advanced data science. New platforms correct this pervasive data missingness problem, so companies benefit from their heavy investments in big data, and build smarter models based on complete, quality data from the start.


Ryan Leurck leads the Analytics and Products teams at Kythera Labs and is a co-founder. He is an engineer and data scientist with over 13 years of experience in operations research, system-of-system design, and research and development portfolio valuation and analysis. Ryan received his start on the research faculty at The Georgia Institute of Technology Aerospace System Design Lab where he led researchers in the application of machine learning and big data technologies. Ryan holds a Bachelor of Science in aerospace engineering from Auburn University and a Master of Science in aerospace engineering from The Georgia Institute of Technology. He can be reached at ryan@kytheralabs.com.

Load more comments