When healthcare data contains biases or suffers from missingness, the resulting analyses, predictions, and interventions may inadvertently perpetuate or even amplify existing inequalities. In the third instalment of our data assets series, we explore the nature of these data challenges and offer practical strategies for mitigation.
Understanding Bias in Healthcare Data
Bias in healthcare data generally manifests as systematic errors leading to unrepresentative or inaccurate outcomes. In practice, this often appears as over- or under-representation of certain populations based on factors such as gender, ethnicity, or socioeconomic status, which can significantly reduce the generalisability of findings.
Several types of bias commonly affect healthcare datasets:
- Selection bias occurs when study populations fail to represent broader patient demographics. Clinical trials frequently exhibit age, sex, or ethnicity-related biases, resulting in treatments that may work less effectively for under-represented groups.
- Measurement bias stems from inconsistent data collection methods, such as variations in measuring equipment or calibration differences between data sources.
- Reporting bias represents a major challenge in scientific literature, where positive findings are more likely to be published, leading to an overrepresentation of favourable results in the evidence base.
- Algorithmic bias can emerge during model training on already biased datasets, leading to unfair outcomes. For instance, AI-trained classifiers for medical imaging might perform better for certain demographics than others due to training data that contained population biases.
These biases don’t exist in isolation – they reflect and reinforce existing social, economic, and systemic inequalities, directly influencing clinical decision-making, patient trust, and health equity.
The Challenge of Missing Data
Missingness in healthcare data arises from various sources and presents unique analytical challenges:
- Incomplete patient records often result from time constraints, oversight, or lack of standardised protocols across healthcare providers. Potentially informative factors, such as housing or socioeconomic status may go unrecorded, leading to inaccurate risk assessments or suboptimal care.
- Variability in reporting systems contributes to data gaps. For example, disease coding differences between ICD9 and ICD10 aren’t directly comparable, and conversion between systems may result in information loss.
- Data entry errors are inevitable in human-managed systems, resulting in omissions or incorrect values being recorded.
The level of bias in data can vary depending on the type of missingness. Understanding the type of missingness is therefore crucial for appropriate analysis:
- Missing Completely at Random (MCAR) data has minimal impact on results as it contains zero systematic bias. For example, in a long-term health study, patients have their blood pressure measured annually. However, due to a technical issue with blood pressure monitors, some readings were randomly lost across different age groups, genders, and health conditions. Here, missing values bear no relationship to observed or unobserved variables, and the probability of being missing is the same for all observations.
- Missing at Random (MAR) data occurs where observed variables are systematically related to missingness. In such cases, the probability of being missing is the same only within groups defined by the observed data (e.g. sex). For example, in the same longitudinal health study, younger and healthier individuals are less likely to attend follow-up appointments. Here, the missingness depends on age and perceived health status (both recorded variables in the study), and does not directly depend on the participants’ blood pressure levels. Adjusting for these known factors can mitigate this type of missingness.
- Missing Not at Random (MNAR) presents the greatest challenge as missingness directly relates to the missing variable itself. Using the same health study as an example, if participants with high blood pressure were avoiding clinic attendance because they feared bad results or potential treatment recommendations, missingness would be directly related to the unobserved value itself (i.e. blood pressure). This pattern can lead to severe bias and inaccurate conclusions.
The consequences of missing data extend beyond statistical considerations, and include:
- Inaccurate conclusions that underestimate disease prevalence and misinform policy decisions, particularly for under-represented population subgroups
- Biased algorithms resulting in inaccurate risk assessments and poorer model performance
- Increased risk of medical errors, delayed diagnoses, and incomplete or ineffective treatment plans
- Disproportionate effects on already vulnerable communities, exacerbating healthcare inequities
Bias Mitigation and Handling Missing Data in AI Models
Mitigating bias in AI models starts with using diverse and representative datasets that capture real-world variability across age, ethnicity, gender, and socioeconomic status. A well-rounded dataset enhances model generalisability and reduces the risk of biased outcomes. Regular auditing of model outputs helps identify and correct biased patterns, often through techniques like weighting. Fairness-aware machine learning methods can further address this by employing algorithms that balance performance across demographic groups while maintaining accuracy.
Handling missing data is equally crucial for building robust AI models. Methods such as statistical imputation (i.e. replacing missing values with median measures or regression-based predictions), and AI-driven data augmentation can generate realistic replacement values based on learned patterns. Synthetic data generation can offer an alternative solution by simulating real data while ensuring privacy by adding controlled noise. Additionally, data standardisation across healthcare systems, using frameworks like Fast Healthcare Interoperability Resources (FHIR) and terminologies such as ICD-10, enhances data uniformity and interoperability, ensuring completeness and consistency across different sources.
The Importance of Multidisciplinary Collaboration
Addressing data challenges requires perspectives from multiple disciplines:
- Healthcare professionals are required to provide domain knowledge to ensure data remains clinically relevant
- Data scientists can apply technical expertise to address missing data and biases
- Epidemiologists can identify factors related to subject recruitment that may skew data
- Ethicists are required ensure adherence to ethical standards with a focus on privacy and fairness
The COVID-19 pandemic highlighted the importance of interdisciplinary collaboration. Data scientists, healthcare professionals, and epidemiologists combined multiple healthcare datasets (e.g. hospital admissions, test data, patient demographics) for real-time tracking and decision-making. This expert input ensured comprehensive and accurate data, improving pandemic responses and informing critical resource allocation decisions.
Regulatory and Ethical Considerations
Equitable data access prevents biases in healthcare models. Ethical frameworks highlight the need for datasets representing all demographics to avoid widening health disparities in predictive models or clinical decision-making.
Informed consent ensures patients understand how their data will be used, including clear explanations of purpose, risks, and potential benefits, while ensuring their participation remains voluntary.
How bioXcelerate AI Improves Data Fairness and Completeness
Large genetic datasets from population studies often suffer from missing data, making cross-dataset completeness essential for meaningful analysis. Using bioXcelerate AI’s ImpMap algorithm, we address missingness across vast genetic datasets, enabling the seamless integration of thousands of datasets to uncover shared genetic contributors to disease. Further, while current state-of-the-art genetic colocalisation algorithms are limited to single-ancestry study populations, PleioGraph can identify shared genetic causal mechanisms across diverse datasets of different ancestry groups. This ensures inclusion of often-overlooked datasets, enhancing generalisability of results across populations
Ensuring fairness and completeness in healthcare data requires diverse, representative datasets and robust imputation methods to fill data gaps while minimising bias. Transparent data access policies further promote equity, while AI-driven augmentation and synthetic data generation can enhance data quality without compromising privacy.
Multidisciplinary collaboration between healthcare professionals, data scientists, and ethicists is crucial for identifying data gaps and ensuring AI models remain both accurate and fair. Best practices focus on standardising data formats, improving cross-system interoperability, and implementing privacy-preserving techniques to protect patient data while unlocking valuable insights.
To learn how bioXcelerate AI can help your organisation improve data integrity and fairness, contact us today.