Clinical Data Sharing for AI: Proposed Framework Could Rouse Debate

Clinical Data Sharing for AI: Proposed Framework Could Rouse Debate

By Deborah Borfitz

March 24, 2020 | A group of doctors from Stanford University has proposed a framework for sharing clinical data for artificial intelligence (AI) that could set off a firestorm of debate about who truly owns medical data, ethical obligations to share it, and how to properly police researchers who use it. On the other hand, the envisioned approach has parallels to the open science tactics currently being uniformly deployed to battle the COVID-19 pandemic.

The framework’s central premise is that clinical data should be treated as a public good when it is used for secondary purposes such as research or the development of AI algorithms, as detailed in a special report published today in Radiology. That means broadening access to aggregated, de-identified clinical data, forbidding its sale and holding everyone who interacts with it accountable for protecting patient privacy, explains study lead author David B. Larson, M.D., M.B.A., vice chair of clinical operations for the radiology department at Stanford University School of Medicine.

Although the framework published in a journal specific to radiology, and three of its authors are radiologists, the structure is “universally applicable to other types of medical data as well,” says Larson.

Disputes over clinical data sharing generally involve those who believe patients own the data and those who think institutions do. But Larson and his colleagues advocate a third approach, saying nobody truly owns the data in the traditional sense once it has served its primary, patient care purpose—whether that happens immediately or in another 20 years.

“When data are aggregated and deidentified, and insights get extracted… that is a separate activity,” he says. “That’s not what it was designed to do initially but it is a fortuitous secondary use.”

The doctors further argue that patients, provider organizations, and algorithm developers all have ethical obligations to help ensure that these clinical observations are used to benefit future patients. “We now have to develop some thinking around how we are going to address the appropriateness of that use.”

The current COVID-19 pandemic is a “pretty clear,” if unfortunate, illustration of what a world might look like with an ethical framework in place so patients can trust that their data won’t be used inappropriately and researchers aren’t stymied in their efforts to use it for clinically beneficial purposes, says Larson. While the framework was written in the spirit of open science, he adds, it also recognizes that algorithms derived from clinical data may have intellectual property associated with it—which should not preclude other people from having access to the same raw materials.

Irreconcilable Differences

Larson and his Stanford colleagues felt pressed to develop an ethical foundation for clinical data sharing in the absence of any formal guidance for holding themselves accountable, Larson says. “Once we did that it seemed reasonable to share it with others.”

The framework is designed to overcome the irreconcilable positions of those who say either patients or providers “own” the data—meaning, who has permission to access it and rights to profit from it—which is “preventing us from moving forward in a reasonable way,” says Larson. The views of neither camp can be fully justified from an ethical standpoint. “We subscribe to an ethics framework (doi: 10.1002/hast.134) developed by Ruth Faden and others that we all should be contributing to the common good.”

Since people have been benefitting from the research and improvement efforts of health systems for centuries, he reasons, they should not be able to withhold the use of data in ways that benefit others in the future. “We think we’re providing an avenue that addresses the major concerns on either side.”

The framework will hopefully lay the groundwork for future, national-level changes to the way both data and organizations are structured, in and outside the U.S., says Larson. It was only relatively recently that discrete data in electronic health records were even available for use, and the tools to process and learn from that data have been on the scene for even less time.

“Patients generally don’t withhold data from their care provider when it is being used on their behalf … [because] a relationship of trust has been established,” Larson says. Yet when people think clinical data is being used by AI developers and other non-provider entities, they’re immediately suspicious.

“We’re pushing back on that and saying if those entities can’t currently be trusted then let’s create the ground rules so they can be… and if they’re not willing to participate in that environment then they shouldn’t have access to the data. Let’s increase the inherent trust in the system by holding those who have access to the data accountable to be good stewards,” as providers are already doing.

“We can hold ourselves accountable and the broader community more accountable for being good data stewards, which hasn’t really happened up until now and we think it should,” says Larson. As envisioned, the entity releasing the data would be responsible for ensuring that it is going to a trusted partner and being used as contractually specified.

It’s “reasonable” for providers who maintain and process clinical data to charge outside entities an access fee, but it should not be excessive, Larson continues. But under no circumstances should they strike an exclusive agreement with one entity that precludes the same access by others.

Regulations will also need to be written to further ensure entities use the data appropriately and for beneficial purposes, Larson says. This would include penalties for using data they receive that is accidentally identified or using technology that allows them to identify individuals from the data.

As Larson points out, the framework refers to “wide” rather than public release of data in the belief that data should be used only by those who identify themselves and agree to be held accountable for its appropriate use.

Early concepts that contributed to the proposed ethical framework were presented at BOLD AIR (Bioethics, Law, and Data-sharing: AI in Radiology Summit), organized by the departments of radiology at Stanford and New York University Langone Medical Center last April, says Larson. The one-day event was co-sponsored by the American College of Radiology, Radiology Society of North America, Massachusetts General Hospital, Stanford Center for Artificial Intelligence in Medicine and Imaging, and the Center for Advanced Imaging Innovation and Research.

A follow-up meeting to further discuss the salient issues is now on hold due to COVID-19. If nothing else, Larson says, the proposed framework is in the literature to fuel thoughtful discussion and hopefully inform future regulation.

Outstanding Questions

The strongly worded announcement of intention came as welcome news to Megan Doerr, a genetic counselor and principal scientist with the open-science organization Sage Bionetworks. “The more people able to use scientific data for scientific solving the more scientific solutions we’re going to have to our problems,” she says.

“The devil is really in the details,” Doerr continues. While she applauds the idea of extending responsibility for the data to anyone who uses it, for example, it is uncertain how that might be practically accomplished.

Institutions have traditionally acted as proxy bonding agents for researchers, so there is “someone to fine or sanction” if there are ethical violations, Doerr says. Who would serve as the ethics watchdog once large data sets are opened to a larger, more diverse community of users? Maybe Lloyds of London, which bonds astronauts on space shuttle missions?

Another concern is how to appropriately protect the privacy and rights of people whose data are being shared, she says. “The more data that is available and can be cross-referenced, the quicker we realize ‘de-identified’ data is not [really] de-identified… We can’t promise anybody that their privacy is going to be protected and as scientists we need to be honest about this, and we are not. We contort in a million different ways to avoid this uncomfortable truth.”

What’s needed are legal protections so if the information is used in ways that are inconsistent with the agreed data use, money would flow to people who were impacted to mitigate the harm, says Doerr.

The proposed framework may also raise social justice and equity questions, she adds, since “brilliant but resource-limited scientists” may not be able to afford the entry cost of problem-solving.

Cost Concerns

The paper talks about a lot of important issues, offers a sound ethical framing for clinical data sharing and is well cited, says Doerr. But she views the proposed framework more as an “opening salvo” due to what it does not address—who might pay for the cost of compute and metadata harmonization, which are the two biggest barriers to more effective AI research.

Radiological datasets are massive, measured in petabytes, and therefore require a tremendous amount of computing power to host, she notes. “Nobody can download the data; it takes forever. “So it’s not like researchers are going to be downloading a local copy of the data to work with… and to create a hosting space for these data is a very expensive thing to do, which is one of the challenges of the All of Us Research Program [of the National Institutes of Health].”

Doerr chairs the researcher application subcommittee for the All of Us Research Program and sits on the resource access board for the All of Us dataset. David Magnus, Ph.D., a professor of medicine, biomedical ethics and pediatrics at Stanford, and one of the authors of the proposed clinical data framework, is vice-chair of the institutional review board for the program.

Doerr wonders: Will researchers have to pay into a system giving them access to a sandbox area where the clinical data are stored? If so, which institutions would be doing the primary data gathering?

Authors of the Radiology report make the point that aggregating clinical data from multiple institutions “may markedly enhance the value of the data,” Doerr says, “which is absolutely true, but… incredibly expensive and difficult. So, who is going to do that work and who is going to pay for it?”

It would be a “tremendous waste” to have individual AI developers be responsible for hosting their own data and doing their own metadata harmonization, says Doerr. More importantly, it could lead to inconsistent results. The datasets they’d be working with would be “effectively tuned into different keys and may not return compatible insights.”

This is a problem Sage Bionetworks has been toiling over for quite a while, Doerr says—as has study co-author Nigam H. Shah, MBBS, Ph.D., associate professor of medicine (biomedical informatics) at Stanford and assistant director of the Center for Biomedical Informatics Research.

The paper alludes to data stewardship and talks about federated learning, a system Sage Bionetworks used for its Digital Mammography DREAM Challenge that demonstrates both the efficacy and challenges of the approach, she says.

“We had hundreds of thousands of digital mammography images and recognized that they couldn’t be de-identified, so we had solvers send their machine learning models to us and we then ran them against the data on their behalf and returned the results to researchers. In this way, they never saw the actual data, but they could still fit their models to it.”

The Challenge was a costly undertaking supported by grants, Doerr says. Scientists at Sage spent an exhaustive number of hours manually harmonizing the datasets so that the data returned authentic, trustworthy results.

Very few people are good at metadata harmonization, she adds, because the cost of compute is so expensive it limits who gets to do AI research. “Honestly, that might be an OK thing right now. I don’t think communities have any idea about the individual and group harms that could be caused by bad AI research… [that] within medicine could really be a problem.”

In discussing their proposed framework, the authors say initiatives such as the All of Us program could serve as an example of “how to allow participation from any qualifying research and development organization following an established vetting process.” The reality is that the researcher application subcommittee is a 25-person team effort with an expected development timeline of two to three years, Doerr says.

On the cost of compute question, Larson says the cost should be borne by the party that accrues the benefit. “If it turns out that the value is mainly to the public, then maybe this should be financed like other research through public and private entities.” Alternatively, commercial entities might pick up the tab if they are profiting from the intellectual property that they derive from the data. “I think there are a number of potential finance models that do not require selling of the data.”

Figuring out who will do the data harmonization work will be an iterative process, he adds. “I think it would be unwise and almost certainly untenable to try to impose a single standard right now because I think there will probably be other purposes that will drive standardization over time more quickly, such as allowing patients to move from one healthcare system to another. There will likely also be other processes to help reconcile multiple standards.”

Pushback Expected

Some medical institutions might try to claim indefinite clinical data ownership by arguing that the information serves important care purposes for patients’ lifetime—or longer, given the implications for family members, says Doerr. But from her perspective, everyone owns a copy of the data; they just need it for different purposes.

“Data are not a traditional commodity,” she says. “I can have a copy, the hospital can have a copy and there can be a copy out in the public domain and all of it still retains its usefulness.” Ultimately, more users only magnifies the value of the data.

“The whole concept of ownership is somewhat flawed,” Doerr says. “As an individual, I have a right to the data that is generated about my body from my body, and in consenting for medical care I give my providers the right to that data, too. I am paying them for the service of interpretation of those data.” The pool of data that gets generated by this service over time “should flow into the public domain and be used to the benefit of the public good.”

Doerr says she feels strongly that everyone has an ethical obligation to ensure that happens, a concept that is more intuitive in nations with a single payer system such as Canada and the United Kingdom. “Our system because of its byzantine structure makes it a little less obvious, but our ethical obligation remains constant.”

This altruistic contribution of data for societal benefit is precisely what is happening in the wake of the coronavirus, “because we have an emergency… [and] we know data sharing can help,” Doerr continues. “One could argue that breast cancer that kills way more people every year might be a similar emergency and there are many other conditions that are equally deadly if not more so.”

If an open science approach can be embraced on a worldwide scale for the COVID-19 outbreak—including a worldwide commitment to make research and data freely available—"why can’t we do it for anything else?” she asks.

Notification to use clinical data should be required, and not just on an individual level, since privacy protection cannot be guaranteed, she contends. “If we allow folks to opt out of the system the data will become even more biased than it already is, and artificial intelligence and machine learning techniques only serve to amplify that bias.”

What’s needed is a “collective conversation” resulting in society-wide consent to this new paradigm acknowledging the value and accepting the risk related to the secondary use of clinical data, she continues. Large-scale genomic data sharing has familial implications that no amount of goodwill can wholly safeguard via the standard informed consent process, especially when identical twins don’t share the same sentiments about dating sharing.

“That’s why lawmaking is going to need to be the central component of this,” Doerr says. “Laws are one of the ways we exert our collective will toward a given end.”

While Doerr enthusiastically supports the level of clinical data sharing envisioned by Stanford radiologists, she notes that a multitude of financial incentives are working against their ideas—notably, the multi-trillion-dollar data brokerage business. Some pushback can be expected from large medical institutions in the U.S., including the not-for-profit Mayo Clinic and Cleveland Clinic that generate a sizable amount of revenue selling data to Google. As was widely reported last November, and referenced in the Radiology report, Google also made a widely debated deal with the 150-hospital Ascension health system to further its AI agenda.