07 Dec 2022

NeurIPS 2022

Authors:  Anna Munoz-Farre, Harry Rose, Sera Aylin Cakiroglu

Abstract

Electronic health records (EHR) offer the opportunity for richer phenotype definition and more accurate risk prediction over bespoke cohorts collected for specific research purposes. Combining multiple structured data sources, such as primary and secondary care records, is crucial to understand patient trajectories and severity for a given disease. However, a key challenge lies in combining multiple ontologies. Current approaches rely on manually curated mappings between ontologies and are often prone to error.

In this paper, we unify ontologies using textual descriptors of concepts like diagnostic codes. We fine-tune pretrained language models to denoise and identify mis- or undiagnosed individuals based on their medical history. We validate our approach using the UK Biobank, a large-scale biomedical database. We demonstrate our method yields calibrated disease predictions for undiagnosed patients compared to non-text and single ontology approaches. Finally, we demonstrate empirically how our method can be used for cohort expansion with an in-depth clinical evaluation for sex-specific diseases and for a Type II Diabetes Mellitus use case.


Back to publications

Latest publications

01 Jun 2024
arXiv Computer Science
Retrieve to Explain: Evidence-driven Predictions with Language Models
Read more
01 May 2024
Journal of Biomedical Semantics, volume 15, Article number: 5 (2024)
Elucidating the Semantics-Topology Trade-off for Knowledge Inference-Based Pharmacological Discovery
Read more
12 Oct 2023
Translational Neurodegeneration. 2023; 12: 47
Janus kinase inhibitors are potential therapeutics for amyotrophic lateral sclerosis
Read more