Can We Take A Minute For AI’s Less Sexy Sidekick: Data Quality.
A young, healthy man receives a letter from his GP. It's the middle of a pandemic, and he's been asked to shield because he is "at risk". He has no idea why.
A team developing an algorithm to predict disease cases finds it works really well when tested. However, their AI model is actually pretty poor at predicting disease cases.
American Fertility Society recommends a treatment for postmenopausal women based on analysis of routine EHR dataset. It turns out the treatment did more harm than good.
The above are examples of what can go wrong when data quality falls by the wayside. The consequences can range from a confusing letter to real patient harm.
In the first case, the man's height had been mistyped in a health record system as 6 inches rather than feet, leading to an incorrect BMI estimation that triggered an addition to the shielding list.
In the second case, the team was unaware that their training and testing data contained duplicates (they weren't obvious). Many of the cases of the disease the model had been trained on were also those it was being tested on, so it looked very good. However, the model did poorly on prediction when tested on another external dataset from another organisation.
In the third case, the AFS study was based on real-world EHR data, which had limitations and biases. This data was the difference between concluding that the AFS new therapy was good for women and harmful overall.
The anthem of our time could be 'anything is possible with AI'. When it comes to health, the sentiment is no different. However, in the excitement of what AI can do, we give far less airtime to its trusty sidekick and foundational element of all data analysis: data quality.
Can you blame us, though? The possibilities of modern AI, from "rapidly sequencing the genome, predicting hospital admissions and even helping explain health conditions to patients," are undeniably compelling.
Data quality, on the other hand, is not new and is, by most people's standards, pretty boring. Very quickly, you get into a world of 'minimum standards protocols', 'data collection pipelines', and 'staging databases' —terms most of us would care to avoid.
That is until you see what can go wrong, from confusing patient comms to patients being prescribed harmful treatment. Data quality might be dull, but it can be the difference between effective and sage AI/data tools and ones that are crappy or downright dangerous.
Here’s a jargon-free primer on data quality, what it is and why you should care about it.
Five things you should know
Data quality is not a black and white concept
Poor data quality, broadly speaking, means a dataset has issues that make it unfit for use in the way we want to use it. These issues can include whether data fields accurately represent reality (is the man really 6 inches tall?), missing data fields or duplicates, and whether the information is consistent across different sources that the data is drawn from.
Data quality concerns have existed since well before modern AI came into the picture. Outside of health, data quality issues have been responsible for space exploration failures, miscalculation of credit scores and accidental share payouts.
It's important to understand that there is no black-and-white definition of "good quality data." A perfect dataset without any issues is not realistic and is not the goal. Data is a product of messy real-world processes, imperfect IT systems, and human input. Waiting for an error-free dataset, especially in the realm of real-world health data, is a futile exercise.
Understanding what "good quality" or "good enough" means and how much we care about certain quality elements will partly depend on the question we are trying to answer, our intended use, and the source of the data. For example, we want a higher data quality threshold when training AI clinical decision-support tools that will advise clinicians.
Data quality matters a lot when comes to health (especially AI and Health)
Data quality in health needs special consideration. The impact of data quality on health, particularly in the context of AI and Health, cannot be overstated.
There’s now a growing push to train AI models on the varied types of available datasets. For example, ‘routine’ health data, such as health record data, clinician notes, and other operational data, or ‘out-of-hospital’ data on social care or wider determinants of health.
These are undoubtedly rich datasets and could provide a lot of valuable insights. Many of these also have significant quality issues because we haven’t invested well in data strategy in these areas.
Historically, much of this health data collected has been primarily for immediate patient management, operations, or billing rather than for developing complex data models. In the UK, datasets can include incorrect diagnosis codes, missing fields for some long-term conditions, and sparse data records in some communities.
We are still learning about the data quality issues that exist in many aspects of the health data ecosystem. This is not easy as data is held across multiple organizations, some outside the NHS health system and with varying capabilities in data strategy.
These rich sources of data have the potential to provide important insights into patient care, treatment, disease progression and more.
Yet without a clear picture of data quality issues, we risk developing models and tools that lead to inaccurate diagnoses, ineffective treatments, and compromised patient safety.
Poor data quality can go hand in hand with bias and unequal health systems
One of the most entrenched issues in health data is the absence of data for certain minority groups. There are many reasons for this, including interactions of these groups with the health system, the way health workers record information and more systemic issues.
The implications of this for AI models and fairness in health systems can be dire. For example, a US-based AI model unfairly recommended lower levels of health support to certain minority groups based on poor-quality training data. This is a big problem for the UK, too, where the pandemic highlighted the extent of health inequalities by ethnicity.
Advanced data models and tools don't mean we no longer need to care about data quality.
It's a common misconception that new technologies and large datasets mean we don't have to consider data quality. A large dataset with poor quality issues will still lead us to the "garbage in, garbage out" principle. Computing power, advanced algorithms, and large datasets won't magically solve the problems.
Researchers have found that data quality matters significantly even when it comes to the performance of advanced AI models, including those able to process large unstructured datasets. Data quality is also a contributing factor to LLM hallucinations.
Notably, influential figures in the field, like Andrew Ng (Co-founder of Google Brain and THE Coursera machine learning course), are now championing a 'data-centric ' approach to AI, prioritising improving the data over the models. This importance of data quality is also evident in the significant amount of time and resources that OpenAI teams invest in curating and cleaning their training data.
Applying the 'data-centric' approach to health, it becomes clear that we need to devote resources to data quality. This is not just a concern for AI and machine learning, but a universal principle that shapes the reliability and effectiveness of any data-driven system.
Datasets won't be perfect, but there's still action we can take.
No dataset is going to be totally immune from quality issues. Data is generated from messy real-world processes, imperfect IT systems and software, and humans. We want our healthcare workers to focus on patient care, not have a situation where the patient deteriorates but 'we get good data'.
However, when we seek to develop and deploy advanced data models that could shape patient care, our data quality standard in health has to be higher. There are clear actions we can take on this.
Open communication is essential, too. When it comes to working with messy, real-world data, it’s crucial to have a transparent discussion about the level of quality we are comfortable with across various applications of these tools. Perhaps the most challenging issue here is ensuring all those using AI tools clearly understand the risks and limitations these tools carry due to varying levels of data quality.
At first glance, data quality can seem like a dry subject full of jargon. However, if you look closer, it goes well beyond the realm of data engineers and analysts.
Data quality means a conversation about the risks we collectively want to take in health decision-making, the inequality we need to address in delivering care and the universal lesson that sometimes the boring stuff turns out to be some of the most important.
In a future article, I will cover the practical ways in which organisations are addressing some of the data quality challenges we face with health data.