Tackling Trial and Data Complexity in DCTs with Advanced Analytics and Machine Learning
Contributed Commentary by Laura Trotta, CluePoints
September 12, 2025 | Advanced statistical and machine learning (ML) models offer new ways to cope with an evolving and increasingly complex clinical trial data landscape. It enables us to interrogate new, complex and raw data sources, perform complex data reviews, learn from past studies and process large volumes of data.
The FDA has said use of ML could “significantly enhance data integration efforts by using supervised and unsupervised learning to help integrate data submitted in various formats and perform data quality assessments.” By using advanced statistics and ML, we can facilitate the process of data review and make it more efficient.
Predictive models can be utilized across both production and research for smarter data review using a combination of supervised and unsupervised learning. Supervised learning is used to learn from labelled data and past issues while unsupervised learning can be used for surfacing unexpected data patterns, anomalies, and outliers.
In this article, I will focus on a specific use case—decentralized clinical trials (DCTs). DCTs bring even further complexity in both study design and the type of data we are collecting. However, by relying on advanced analytics we can increase the efficiency of data review.
Centralized Monitoring Supported by Advanced Analytics
The FDA recommends using risk-based monitoring approaches and the use of centralized monitoring to “identify and proactively follow up on missing data, inconsistent data, data outliers, and potential protocol deviations that may be indicative of systemic or significant errors” in DCTs.
Advanced analytics can support centralized monitoring and risk-based quality management (RBQM). Central statistical monitoring (CSM) allows the analysis of all clinical and operational variables collected during study conduct to detect any risk. The principle is simple. We will compare one entity, which may be one site, patient, region, or country, to all other entities of the same type for all variables and all possible tests to identify statistical anomalies.
Anomalies might range from under reporting of adverse events to propagation of data and global outliers. This can be used to power either key risk indicators (KRIs) or the overall data quality assessment. Issues confirmed by this approach have included misunderstanding of study protocol, miscalibration of equipment, GCP compliance issues, data tampering, and fraud.
In our own research, we have found more than 8 in 10 sites (83%) using CSM saw their risk score decrease over the course of the study, compared with just over half (56%) of sites not using CSM.
New Sources of Risk and Data Sources in DCTs
DCTs bring extra complexity to data analysis. New independent sources of study activity and data collection, including pharmacies, home healthcare providers, and patients themselves, can impact data reliability and patient safety. DCTs also have additional modalities of data collection, including telemedicine, eCOA, and wearable and connected devices. Measurements are often replicated which creates longitudinal data.
These digital health technologies are not just capturing clinical data but also audit trail data. They record every entry and every change—from who made it to why a change was made.
This adds an extra layer of complexity, but it is also a useful source of information to look for potential issues and risks during study conduct. For example, audit trail data allows us to identify patterns of system usage that uncover issues with collected data.
Models and Tests for DCT Data
When conducting DCTs we need to rely on advanced analytics that take into consideration the specificity and complexity of the data. These models and tests need to account for natural variability across patient sites and regions but also for the replicated nature of longitudinal data.
Models also need to rely on a flexible definition of time within the study through days and weeks rather than site visits. They need a time model dependence, with best time dependence automatically selected by the system, and they need to extend to support anomaly detection in audit trail data.
Audit Trail Review – Anomalies and Root Causes
If we focus more specifically on audit trail review, there are several use cases where advanced analytical methods have been used to support the review of data.
In the first example, a time similarity test revealed a risk scenario of unusual proximity in ePRO entry times between patients at a site. In the second example, a mean test, within-patient variability test, and between-patient variability tests revealed the average duration to conduct patient assessments was unusual and there was unusual variability in assessment durations. In the third example, a proportion test and an event rate test revealed an atypical proportion of data updates.
The possible root causes of these examples varied from critical issues, like the site potentially fabricating ePRO data, to unintentional issues, such as patients being improperly trained on ePRO completion. Advanced analytics can support audit trail review to identify these issues.
Conclusion
DCTs present an excellent use case for advanced analytics because they offer extra complexity in both the design of the study and the type of data we are collecting. They are also heavily reliant on digital health technologies. DCTs need to rely on RBQM to identify, monitor, and mitigate new risk scenarios. They need CSM to detect sites, regions, and countries at risk but also other dimensions of risk, such as patients and local healthcare providers.
Advanced models can take care of the time dependence of the data and highlight anomalies. By advancing our analytics and making the model understand complex data, we streamline and simplify the process of data review. The complexity is in the model, not the process. Continuing to advance analytics will increase our insights and allow people to focus on what matters most—improving the lives of people worldwide.
Dr. Laura Trotta joined CluePoints in 2015 and moved into her current role as Vice President of Research in January 2022, where she leads a team of research scientists responsible for developing new statistical and machine learning algorithms to assess the quality of clinical trial data. Laura holds a Master’s degree in Biomedical Engineering and a PhD in Applied Mathematics from the University of Liège, Belgium. She can be reached at laura.trotta@cluepoints.com.
Leave a comment

