Mining Electronic Health Records for Real-World Evidence

The rapid accumulation of large-scale Electronic Health Records (EHR) presents considerable opportunities to generate real-world evidence to inform clinical decision-making and accelerate drug development. However, the complexity of EHR has turned them into a formidable testing ground for cutting-edge AI algorithms. Furthermore, a significant gap still exists between algorithm development in the computer science community and clinical translation within the healthcare community. This tutorial aims to bridge this divide by fostering mutual understanding between the two communities by discussing using advanced machine learning and data mining technologies tailored to tackle real-world healthcare challenges, including 1) using EHR and trial emulation for understanding Long Covid and drug repurposing for Alzheimer’s disease, and 2) risk prediction and associated fairness, interpretability, generalizability, etc., issues. We will conclude this tutorial by delving into potential opportunities for future research and unveiling the prospects of a career as a health data scientist.

Tutorial information

Real-world data (RWD) are usually referred to as patients' data collected during the delivery of health care. Common real-world data sources include electronic health records (EHRs), administrative claims, etc. Taking EHRs as an example, they can have a variety of data from structured domains (e.g., diagnoses, prescriptions, procedures, laboratory tests, vital signs, etc.) to unstructured domains (e.g., clinical notes, medical images, etc.). Real-World Evidence (RWE) is defined by the FDA as "clinical evidence about the usage and potential benefits or risks of a medical product derived from analysis of RWD", or can be extended to clinical evidence generated from observational noninterventional study. In this tutorial, we aim to introduce how to use EHRs and machine learning methods to solve real-world healthcare challenges. We will introduce machine-learning-driven trial emulation methods and how to use them to improve our understanding of and ability to predict, treat, and prevent the post-acute sequelae of SARS-CoV-2 (or Long COVID), and to do comparative effectiveness analysis and drug repurposing for Alzheimer's disease.

On the other hand, we will introduce machine learning and EHR-based risk prediction and highlight critical issues associated including fairness, interpretability, and generalizability. We first explore methods to measure and address algorithmic disparities (potential discrimination against certain disadvantaged subpopulations) in risk prediction models. Then, we will discuss the need for interpretability and introduce the methods to explain risk prediction models. Lastly, we will introduce how to train a risk prediction model with better generalizability when applied to different populations or datasets. The outline and the associated materials are summarized below:

Tutorial materials and outline

  • Introduction (10 min)
  • Trial Emulation for Generating Real-world Evidence (70 min) [PDF]
    • Randomized Controlled Trial, Trial Emulation, and Machine Learning-driven Trial Emulation for Causal Inference
    • Using EHR and Trial Emulation to understand Long COVID
    • Using EHR and Trial Emulation for Alzheimer’s disease drug repurposing
  • Advancements in Risk Prediction for Healthcare (70 min) [PDF]
    • Machine Learning for Risk Prediction in Health Care
    • Quantifying and Addressing Algorithmic Disparity
    • Explaining Models by Causal Path Decomposition
    • Improving Model Generalizability across Multiple Sites
  • Conclusion, Discussion and QA (20 min)


This tutorial tries to bridge the gap between methodology (CS community) and clinical translation (medical community). Machine learning tools (e.g., causal inference, predictive modeling) for mining EHRs tailored to specific healthcare applications will be introduced. This tutorial will be highly accessible to all data mining researchers, students, and practitioners who are interested in health data science. The tutorial will be self-contained and no special prerequisite knowledge is required.


Dr. Chengxi Zang currently is an Instructor in the Department of Population Health Sciences, Weill Medical College of Cornell University. He is also a faculty in the WCM Institute of AI for Digital Health (AIDH). He got his Ph.D. from Tsinghua University in January 2019 with an Excellent Ph.D. Dissertation Award in the Computer Science Department and an Excellent Ph.D. Award in Tsinghua University. His research interest is AI for healthcare. His current focus is using AI/Machine Learning, and large-scale Real-World health Data to generate robust and generalizable real-world evidence, aiming to solve top healthcare challengings including drug repurposing for Alzheimer's Disease, understanding Long COVID, preventing suicide, etc. His research has been published in the top medical journals such as Nature Medicine, Nature Communications, Journal of General Internal Medicine, Scientific Reports, Cell Patterns, Archives of Pathology & Laboratory Medicine, as well as top CS venues including KDD, AAAI, TKDE, ICDM, etc. His papers have won ICDM'18 Best Paper Candidate and the Best Paper Award at AAAI'20 Workshop on Deep Learning on Graphs. His research/algorithms/codes have been applied to companies including Tencent, NAVIDIA, Boehringer Ingelheim, etc., and have received wide media coverage.

Dr. Weishen Pan is currently a postdoctoral research associate in the Department of Population Health Sciences, Weill Cornell Medicine, Cornell University. He got his Ph.D. from Tsinghua University. His primary research interest is machine learning algorithms development in computational medicine, particularly on model fairness and interpretability. He has published on top machine learning and data mining conferences including KDD and NeurIPS. His research on explaining the algorithmic disparity by causal pathway decomposition was highlighted in AMIA 2021 Year-in-Review Session. He won the data challenge on PTHrP results prediction organized by AACC as the core team member.

Dr. Fei Wang is currently an Associate Professor of Health Informatics in the Department of Population Health Sciences, Weill Cornell Medicine, Cornell University. His major research interest is data mining and its applications in health data science. He has published more than 200 papers on the top venues of related areas such as ICML and KDD. His papers have received over 26,000 citations so far with an H-index 78. His papers have won 7 best paper awards at top international conferences on data mining and medical informatics. His team won the championship of the NIPS/Kaggle Challenge on Classification of Clinically Actionable Genetic Mutations in 2017 and Parkinson's Progression Markers' Initiative data challenge organized by Michael J. Fox Foundation in 2016. Dr. Wang is the recipient of the NSF CAREER Award in 2018, the inaugural research leadership award in IEEE International Conference on Health Informatics (ICHI) 2019, Amazon AWS Machine Learning for Research Award in 2017 and 2019, as well as Google Faculty Research Award. Dr. Wang’s Research has been supported by NSF, NIH, ONR, PCORI, MJFF, AHA, etc. Dr. Wang is the chair of the Knowledge Discovery and Data Mining working group in the American Medical Informatics Association (AMIA).