Abstract
Causal discovery sits at the intersection of statistics, machine learning, and artificial intelligence (AI), leveraging data-driven algorithms to complement domain expertise for constructing causal diagrams that represent complex real-world systems. Causal diagrams help synthesize large volumes of information and support explainable AI (xAI). By building causal models, we can address interventional and counterfactual questions. While not providing conclusive proof of the causal relationships, we show how causal discovery can be a valuable tool for a range of tasks in biometrics, offering insights for decision-making in areas ranging from medical treatment to policy development, generating new clinical hypotheses and guiding the design of follow-up studies that explore the causal hypotheses in greater depth. Despite tremendous methodological advances in recent years, uptake in real-world applications remains limited, partly due to the perceived complexity of the methods and the scarcity of accessible training. This course is designed as a starter kit for researchers and practitioners in any area of biometrics to enter the world of causal discovery and explore the potential of combining expert knowledge with automated methods to build causal models from real-world data.
The course is structured into three sessions. The first session will define the terminology of causal graphs before introducing the fundamentals of causal discovery, including the concept of equivalence classes and the most common constraint- and score-based algorithms, such as PC (Peter and Clark) and GES (Greedy Equivalent Search), along with their underlying assumptions. The second session will focus on state-of-the-art sampling-based Bayesian causal discovery methods, which naturally account for uncertainty of the causal diagram in their results, such as the estimation of intervention effects. The third session will demonstrate a reproducible workflow that integrates domain expertise with data to find plausible causal diagrams for a data-generating mechanism and draw insights into a health science phenomenon under investigation. Participants will engage in hands-on exercises throughout the course to reinforce and apply the concepts covered. The session will conclude with a novel application in a real-world clinical setting, using data from the Novartis-Oxford multiple sclerosis database, showcasing how the results provide insights into which biological pathways are likely involved in disease progression. For illustration, we will use case studies from epidemiology and clinical research, where we wish to understand the interplay between personal characteristics, symptoms, treatment, risk factors and clinical outcomes. The analytical pipeline will use R packages, including pcalg, BiDAG and Bestie. The course presents a complete, reproducible workflow that combines expert knowledge with data to find plausible causal diagrams, simulate interventions, and estimate putative intervention effects.
Prerequisites
Knowledge of core statistical concepts is expected. Prior exposure to causal inference terminology and central concepts in causal modelling, including diagrams, is beneficial but not mandatory. No previous experience with causal discovery is required. Familiarity with the R statistical software is a prerequisite for effectively engaging in the hands-on tutorial. Participants should bring their own device for the practical activities.
Learning Objectives
Participants will gain hands-on experience applying causal discovery algorithms in R and interpreting their outputs. By the end of the course, attendees will be equipped to carry out a reproducible, end-to-end analysis, from data to inference, using causal diagrams. Specifically, they will be able to:
- Explain the core components of causal discovery and inference through causal diagrams.
- Recognise and understand the key concepts of different approaches: constraint-, score- and (Bayesian) sampling-based methods for causal discovery.
- Implement and compare various algorithms for causal discovery using R.
- Discuss the strengths and limitations of each approach.- Interpret results to draw insights into the underlying data-generating mechanism. Examples include whether intervening on a variable is likely to affect an outcome, whether there may be confounding in the observed association between an exposure and an outcome or whether mediators exist on a path from an exposure to an outcome.
Spatial-referenced data is a common feature in biometric applications, but despite extensive research in this area, there remain several crucial issues that require further attention. With the advancement of technology, spatial data has become more abundant, and this new perspective has allowed for the collection of diverse data of interest in the same region, leading to an increased focus on multivariate models.
While there are methodologies in the literature to deal with multiple responses, there is still ample scope for research and improvement, such as a deeper understanding of existing techniques and the development of scalable models to handle the growing volume of observed data. Additionally, it is important to note that some response variables are not normally distributed, and non-Gaussian and discrete processes are essential for realistic modeling. Moreover, spatial data can be measured at different levels, such as point, satellite grids, or political divisions, creating a diverse range of data. Since the scale of these measurement levels is not necessarily uniform, techniques for dealing with spatial misalignment, data fusion, and prediction for area data are critical for creating models that utilize all available information effectively.
In summary, the research directions for spatial-referenced data in biometric applications include the development of spatially explicit models that incorporate spatial dependencies and environmental covariates, spatial-temporal modeling, and methods for dealing with non-normality, diverse measurement levels, and spatial misalignment.
About the Instructors
See above.