Short Course Program

Full-day and Half-day Short Course proposals have been selected for presentation on 12 July, which is day one of the International Biometric Conference.  These courses are taught by experienced professionals who are experts in their fields, so you do not want to miss out! You may choose one or more Short Courses when you register for the conference. 

Short Course Time Frames (tentative):


Full-Day Courses:  9:00 - 18:00 (SC01, SC02, SC03, SC04)


Morning Half-Day Courses:  9:00 - 13:00 (SC05, SC06)


Afternoon Half-Day Courses:  14:00 -18:00 (SC07, SC08)

Courtesy Seoul Tourism Organization

Full Day Courses - 12 July 2026

For an additional fee, attendees may add a full day short course to their registration.

SC01: Data analysis with competing risks and intermediate states

Ronald Geskus; Hein Putter

COURSE INFORMATION

Abstract

The role of competing risks and intermediate states in the analysis of time-to-event data is increasingly acknowledged. However, confusion remains regarding the proper analysis. The most important reason for the confusion is conceptual. What do the different quantities represent, and which assumptions do we need to make for unbiased estimation? For example, does censoring need to be independent? Or do we need to make the Markov assumption in the multi-state setting? Once a model that gives interpretable results has been chosen and the assumptions are satisfied, estimation is relatively straightforward with right censored and/or left truncated data using readily available software. In the morning we cover competing risks analysis and explain how it differs from a marginal analysis in the presence of competing risks. We define the cause-specific hazard and subdistribution hazard and show how they relate to each other and to the cause-specific cumulative incidence. We give three algebraically equivalent estimators of the cause-specific cumulative incidence. Next, we give an overview of regression models for these three quantities and explain their difference in interpretation. We discuss the proper analysis in relation to the type of study question (inferential, predictive or causal).The afternoon is devoted to multi-state models. We define the transition hazard, which is the extension of the cause-specific hazard to the multi-state setting. It forms the basis for the transition probability and the state occupation probability. We explain the Aalen-Johansen estimator of the transition probability. Then we discuss how to incorporate covariables in regression models for the transition hazards, and how these can be translated to transition probability estimates to obtain dynamic predictions. Throughout, we illustrate how estimation and analyses can be performed in R, using the standard survival package as well as additional packages.

 
Competing risks. I) We give several examples in which competing risks are present. We distinguish between situations in which several event types are of equal interest, situations with one event type of interest while the competing risk is a human intervention, and situations with one event type of interest and the competing risk is a biological event. II) We introduce two approaches to competing risks analysis. One is the multi-state approach, which is based on the cause-specific hazard. The estimator of the cause-specific hazard is the standard rate estimator. It leads to the Aalen-Johansen estimator of the cause-specific cumulative incidence. The other is the subdistribution approach. We define and estimate the subdistribution hazard, which directly relates to the cause-specific cumulative incidence. III) We explain why the Cox model can be used in a competing risks setting, but with a different interpretation. An alternative is the Fine and Gray model, which quantifies the impact of covariables on the subdistribution hazard. IV) We contrast marginal analysis and competing risks analysis. We discuss the appropriate approach for inferential, predictive and causal study questions in a competing risk analysis. We briefly discuss the problem of causality in a competing risk setting. Multi-state models. I) We introduce the most important concepts: transition hazard, transition probability and state occupation probability, and discuss the Markov assumption. We show how transition hazards and probabilities are related to each other, leading to the Aalen-Johansen estimator. II) We discuss how to incorporate covariables in (proportional hazards) regression models for the transition hazards, and introduce transition-specific covariables. III) We explain how, using proportional hazards models for the transition hazards, dynamic prediction probabilities can be obtained for patients with specific covariable values. IV) We contrast dynamic prediction using models for the transition hazards with more direct approaches for modeling transition probabilities, like landmarking.


Prerequisites

Some formulas are used, but no mathematical proofs are given. The overall difficulty level of the course is low to moderate. Participants are expected to have a good understanding of the techniques of time-to-event analysis with a single event type. Having some experience with the analysis of time-varying covariables is a pre, but not strictly necessary.

Learning Objectives

At the end of the course participants should:1. Be able to decide which quantity is best suited to answer a specific research question in a setting with competing risks;2. Be acquainted with the concepts of cause-specific hazard, subdistribution hazard and cause-specific cumulative incidence;3. Be able to choose between a Cox model and a Fine and Gray model;4. Be acquainted with concepts in multi-state models like transition hazards, transition probabilities, state occupation probabilities, the Markov assumption;5. Understand the relation between transition intensities and transition probabilities, and be acquainted with the Aalen-Johansen estimator;6. Understand how (transition-specific) covariables can be included in proportional hazard models for the transition hazards;7. Know how to perform the analysis using statistical software;8. Be able to correctly interpret the results that are obtained.


Laptop

Recommended.

About the Instructors
More information coming soon.

SC02: Dynamic Predictions for Longitudinal and Time-to-Event Outcomes, with Applications in R

Dimitris Rizopoulos

COURSE INFORMATION

SC03: Multiple Imputation of Multilevel Data

Shahab Jolani

Abstract

This course focuses on data collected in follow-up studies. Outcomes from these studies typically include longitudinally measured responses (e.g., biomarkers) and the time until an event of interest occurs (e.g., death, dropout). The aim is often to utilize longitudinal information to predict the risk of the event. An essential attribute of these predictions is their time-dynamic nature, i.e., they are updated each time new longitudinal measurements are recorded. In this course, we will introduce the framework of joint models for longitudinal and time-to-event data, explaining how it can be used to estimate in settings involving a single event and competing risks. Special attention will also be given to the validation of these predictions using discrimination and calibration measures. The course will include two software practical sessions illustrating how such predictions can be calculated and validated using the R package JMbayes2 (https://drizopoulos.github.io/JMbayes2/).

 
- Session 1: Introduction of the framework of joint models for longitudinal and time-to-event data
- Session 2: Joint models for multiple longitudinal outcomes and competing risk event processes
- Session 3: Dynamic predictions for one longitudinal and one event outcome + JMbayes2 practical to fit joint models and estimate risk predictions
- Session 4: Dynamic predictions for multiple longitudinal outcomes and competing risks + JMbayes2 practical to fit joint models and estimate risk predictions


Prerequisites

The course assumes knowledge of basic statistical concepts, such as standard statistical inference using maximum likelihood and regression models. Additionally, a basic understanding of mixed-effects models and survival analysis would be beneficial. Participants are required to bring their laptops with the battery fully charged. Before the course, instructions will be sent for installing the required software.

Learning Objectives

Understand the framework of dynamic predictions, including their calculation and validation.- Be able to define an appropriate joint model to derive dynamic predictions in the motivating setting, and estimate using JMbayes2.- Be able to use the estimated model to calculate predictions for different subjects.- Be able to internally or externally validate the derived predictions.


Laptop

Recommended.

About the Instructors
More information coming soon.

Abstract

Multiple imputation (MI) is widely used to address the issue of missing data in practice. However, standard implementations of MI often assume independent data, making them unsuitable for nested or clustered data, such as multicentre studies and individual participant data meta-analysis. Recent developments in methodology (Jolani, 2025; Jolani, 2018; Audigier et al., 2018; Jolani et al., 2015) now enable imputation of multilevel data that effectively preserve the hierarchical structure of the data.Most attendees of the IBC2026 conference are likely familiar with multiple imputation but may not have explored its application to multilevel data. As multilevel data are increasingly encountered in (bio)medical research, this course addresses a pressing need for robust and practical solutions.This course describes the difficulties in handling missing data in multilevel settings, notably the challenge of accounting for the multilevel structure of the data and addressing the coexistence of systematically missing data (where a variable is missing for all individuals in a study) and sporadically missing data (where a variable is missing only for some individuals in a study). Participants will gain insight into two main families of imputation methodologies – joint modelling and fully conditional specification (FCS, or chained equations) – along with their respective strengths and limitations.

References: Audigier, V., White, I.R., Jolani, S., et al. (2018). Multiple imputation for multilevel data with continuous and binary variables. Statistical Science, 33(2), pp. 160–183.- Jolani, S. (2025). Hierarchical imputation of categorical variables in the presence of systematically and sporadicallymissing data. Research Synthesis Methods, pp. 1–29. https://doi.org/10.1017/rsm.2025.10017- Jolani, S. (2018). Hierarchical imputation of systematically and sporadically missing data: An approximate Bayesian approach usingchained equations. Biometrical Journal, 60(2), pp. 333–351.- Jolani, S., Debray, T., Koffijberg, H., Van Buuren, S., and Moons, K. (2015). Imputation of systematically missing predictors in an individual participant data meta-analysis: A generalized approach using MICE. Statistics in Medicine, 34, pp. 1841–1863.

    Prerequisites

    Participants are expected to have a general understanding of multilevel modelling, such as familiarity with concepts like random intercept and random slope models. Additionally, a basic working knowledge of R is required. Prior experience with multiple imputation using the MICE package in R is an advantage but not mandatory.

    With a focus on the FCS framework, we provide theoretical background and discuss one-and two-stage multilevel imputation methods as general FCS imputation approaches for handling missing data in multilevel structures in the morning.In the afternoon, we conclude with a hands-on practical session using the popular MICE package in R where participants will work through a case study. A step-by-step guide will be provided, enabling participants to confidently specify and perform the imputation task.

    Learning Objectives
    By the end of the course, participants will be able to:- Understand the unique challenges of imputing missing data in multilevel settings.- Identify the strengths and limitations of FCS imputation.- Distinguish between one-stage and two-stage multilevel imputation methods- Apply multilevel imputation methods to their own datasets using the MICE package in R.

    Textbook
    TBD.

    Laptop
    TBD

    About the Instructors

    More information coming soon.

    COURSE INFORMATION

    SC04: Large Language Models: Statistical Foundations, Transformer Architectures, and Applications in Biostatistics

    Paulo Rodrigues

    COURSE INFORMATION

    Abstract

    Large Language Models (LLMs) like BERT, GPT, and BioGPT are revolutionizing how researchers extract, process, and generate text-based information across various scientific domains. In biostatistics and public health, LLMs offer powerful tools to automate literature reviews, analyze electronic health records (EHRs), summarize clinical notes, and support reproducible data workflows. Despite their increasing presence, most LLMs are developed and applied in non-statistical communities, raising critical concerns related to uncertainty quantification, explainability, bias, and ethical use. This course aims to bridge this gap by introducing biometricians and statisticians to the theoretical foundations, computational strategies, and applications of LLMs from a statistical perspective.

     
    Statistical foundations of language models (n-grams, entropy, perplexity). Transformer architecture and attention mechanisms. Overview of BERT, GPT, and pretraining. Prompt engineering for biomedical applications. Statistical inference: uncertainty, calibration, interpretability. Ethical issues, RLHF, fairness, and watermarking of LLM outputs.


    Prerequisites

    Participants should have:

    -- Basic knowledge of probability and statistics

    -- Familiarity with statistical modeling concepts No prior experience in deep learning or natural language processing is required, but a general interest in modern data science methods will be beneficial.

    Optional:-- Some code demonstrations will use Python, but no programming is required to benefit from the course. Jupyter notebooks will be provided for those who wish to explore further.

    Learning Objectives

    By the end of the course, participants will be able to:(1) Understand the statistical foundations of language models, including probabilistic modeling, entropy, and perplexity;(2) Explain the architecture of Transformer-based models such as BERT and GPT, including attention mechanisms and positional encoding;(3) Apply prompt engineering techniques (zero-shot, few-shot, in-context) to guide LLM behavior in practical tasks;(4) Explore applications of LLMs in biostatistics and biomedical research, such as EHR summarization and literature synthesis;(5) Evaluate LLM outputs using statistical inference tools, including calibration, uncertainty, and explainability;(6) Discuss alignment, RLHF, fairness, reproducibility, and ethical considerations related to the use of LLMs in scientific contexts.


    Laptop

    Recommended.

    About the Instructors
    More information coming soon.

    Half Day Courses - 12 July 2026

    For an additional fee, attendees may add a one or more half day short courses to their registration.

    SC05: A Hands-On Guide to Precision Medicine: Methods and Applications

    Nikki Freeman; Tarek Zikry; Tianyi Liu

    COURSE INFORMATION

    Abstract
    Precision medicine promises to match the right treatment to the right person at the right time. As evidenced by the 2021 special issue of the Journal of the American Statistical Association titled “Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery,” precision medicine is one of the most prominent and promising emerging methodological fields in biostatistics. Moreover, the number of ongoing clinical trials focused on developing precision medicine evidence in the context of back pain, behavioral health, and diabetes, among others, continues to grow, with over 200 trials in 2022, a near tenfold increase over the 10 years prior. Motivated by the potential of precision medicine, both methodologically and in clinical research, the goal of this course is to introduce participants to contemporary precision medicine: (1) the identification of heterogeneous treatment effects, (2) the construction of individualized treatment rules, and (3) sequential multiple assignment randomized trials (SMARTs). This course will emphasize conceptual understanding, methodological rigor, and practical application, with implementation examples utilizing the R package DynTxRegime included.

     
    - Course and instructor introductions
    - Welcome and overview of the course objectives
    - Introduction of instructors and their areas of expertise
    - Brief outline of the course structure and how modules connect.
    - Module 1: Heterogeneous treatment effects- Key Concepts: - Average Treatment Effect (ATE), Conditional ATE (CATE), Individual Treatment Effect (ITE)- Importance of heterogeneity in treatment responses - Methodological Approaches:- Regression-based modeling- Causal forests and nonparametric methods- Meta-learners (T-learner, S-learner, X-learner)- Practical Component:- Case study application: computing treatment effects in randomized clinical trial data- Hands-on code walkthrough for empirical implementations and computation considerations
    - Module 2: Individualized treatment rules- Key Concepts:- Individualized Treatment Rules (ITR), Dynamic Treatment Regimes (DTR), Value function- Clinical and policy motivations for ITRs- Precision medicine and ITR rules under resource constraints- Methodological Approaches:- Augmented Inverse Probability Weighted Estimation (AIPWE)- Value-search methods- Classification-based approaches (e.g., outcome-weighted learning)- Q-learning and constrained Q-learning- Practical Component:- Case study illustrating ITR derivation and real-data application using R package DynTxRegime- Applying ITRs in resource-constrained health settings
    - Module 3: Sequential multiple assignment randomized trials (SMARTs)- Key Concepts:- SMART design principles- Embedded DTRs, replication, and inverse probability weighting- Handling intercurrent events and decision points- Methodological Approaches:- Design considerations and common estimands- Estimation strategies tailored to sequential decision-making- Practical Component:- Case study featuring a SMART dataset- Coding exercise to construct, evaluate, and compare embedded regimes
    - Closing and discussion
    - Summary of key takeaways from each module
    - Recommendations on further reading and continued learning resources
    - Q&A with instructors.

    Prerequisites
    We recommend participants to have:- An understanding of concepts in statistical inference, specifically conditional probability, hypothesis testing, and linear models. Familiarity with causal inference concepts such as potential outcomes, confounding, and common assumptions such as the Stable Unit Treatment Value Assumption (SUTVA)- Basic experience with R programming - loading packages, data handling, and building linear models. Prior exposure to topics such as propensity score methods, machine learning, or clinical trial design is helpful but not required. Though our case studies will focus on certain biomedical or public health areas, we do not require participants to have an understanding of clinical health applications or disease areas.

    Learning Objectives
    Participants of this short course will be able to:1) Define key precision medicine terminology and understand associated concepts in causal inference, statistical modeling, and machine learning.2) Describe the foundations of treatment effect heterogeneity from a causal perspective and explore estimation techniques of causal treatment effect.3) Learn to derive and evaluate treatment rules tailored to individual characteristics for decision-making.4) Understand the application of precision medicine principles to adaptive interventions, clinical trial strategies, experimental design, and beyond.5) Gain hands-on knowledge of working on real-world data using proper precision medicine methods and the DynTxRegime R package, and understand how to adapt and apply ITRs and other precision medicine methods in low-resource health settings.
    Laptop
    Recommended.
    About the Instructors
    More information coming soon. 

    SC06: Bayesian borrowing in clinical trials: design choices, assessment of operating characteristics and reporting

    Annette Kopp-Schneider; Silvia Calderazzo

    COURSE INFORMATION

    Abstract
    Openness and interest towards clinical trial designs that allow incorporation of external information is currently increasing from both sponsors and regulators. Numerous biostatistical approaches are being proposed for implementation. Evaluation and interpretation of such designs are however a matter of ongoing discussion (see EMA workshop: https://www.ema.europa.eu/en/events/workshop-use-bayesian-statistics-clinical-development).This course offers the participants theoretical and practical tools to deepen the understanding of Bayesian clinical trial designs incorporating external information. It will give guidance on how to investigate and communicate design properties. Special focus will be placed on the underlying trade-offs between robustness to heterogeneity between current and external trial information and sample size/power gains. In particular, the course will provide:(i) an overview of the Bayesian approach and its use in clinical trial hypothesis testing and effect estimation;(ii) a review of the main Bayesian robust external information borrowing approaches available in this context;(iii) analytical results and relationships on how information borrowing impacts the operating characteristics of the trial, i.e., its potential advantages as well as risks;(iv) guidance on simulation studies and graphical reports to improve interpretation and communication transparency of the trial design. The course will also comprise a practical session where participants will be asked to discuss the implementation of information borrowing and its impact through case-studies.

     
    Planned sessions:
    - Introduction to Bayesian thinking (30 min)
    - Phase II clinical trials and how to plan and analyze them under the Bayesian paradigm: advantages, design choices, reporting. (30-45 min)
    - Overview of robust priors and robust approaches for dynamic information borrowing (30-45 min)- Assessment of the trial’s operating characteristics: analytical results, simulation and communication (45-60 min)
    - Practicals: case studies (discussion among participants) (45-60 min)

    Prerequisites
    Pre-requisites are basic knowledge of probability distributions and hypothesis testing concepts.

    Learning Objectives
    The participants are expected to be able to implement, report and guide the design choices of a Bayesian clinical trial with efficacy endpoints and information borrowing. They would be able to communicate potential advantages and disadvantages and translate trust in external information into quantifiable metrics that can guide the selection and tuning of an appropriate information borrowing approach.

    About the Instructors

    More information coming soon.

    SC07: The 3Rs of Trustworthy Science: Reproducibility, Replicability, and Robustness. Theory and Applications in R.

    Filippo Gambarota; Margherita Calderan

    COURSE INFORMATION

    Abstract
    In recent years, the scientific community has become increasingly concerned with the reliability and credibility of published research. Large-scale replication projects have revealed a general pattern of failure to replicate original findings or a systematic overestimation of original effect sizes, in part due to underpowered studies and questionable research and measurement practices. Alongside the growing awareness of the problem and the development of large-scale multilab replication projects, there is a need for proper theoretical and statistical formalization of the issue. The first “R” refers to the computational reproducibility of statistical analyses, which is the first step toward a reliable and trustworthy science. The second “R” is replicability, is best considered from both theoretical and statistical perspectives. How do we define replicability? How can we formalize, assess, and evaluate the results of a replication study? What is the most appropriate statistical measure of replication? Third, we introduce robustness through the lens of multiverse analysis—a systematic approach to evaluating how analytic decisions influence results. These three distinct but interconnected areas are fundamental to reliable science and should be considered, depending on the specific research question, when evaluating results. The aim of the workshop is to provide a theoretical and practical overview of how to define, evaluate, and achieve these three Rs from statistical and computational perspectives.

     
    The course is divided into the three Rs of trustworthy science: Reproducibility, Replicability, and Robustness.
    - Reproducibility: The main focus is on literate programming using Quarto, which combines narrative text with programming code—particularly in R. Both basic and advanced features of Quarto will be introduced, with a particular focus on producing reports and papers. A structured workflow will then be presented, combining Quarto with version control systems and online repositories to create a robust, future-proof, and reproducible scientific environment.
    - Replicability: This section is the core of the course. We will begin with a brief overview of the debate on the replicability crisis in science, followed by the presentation of a unified and formalized framework for replicability using a meta-analytic approach. This includes an introduction to meta-analysis as a statistical tool. We will then provide an overview of key frequentist and Bayesian methods for planning and assessing replication studies, along with their strengths, weaknesses, and critical considerations. Each step will be supported by real and simulated data. In addition, the online book and associated R package at https://github.com/filippogambarota/replicability-book will be used to implement the methods.
    - Robustness: This module will introduce multiverse analysis as a tool for both exploratory and inferential analysis of a single dataset using multiple statistical approaches. We will present the most effective descriptive statistics and graphical tools for this purpose. Particular focus will be given to inferential methods that combine multiple analyses on the same dataset. An overview of these inferential methods, along with their implementation in R, will be discussed.

    Prerequisites
    The course is intended for academic statisticians, applied statisticians, and PhD students or postdoctoral researchers in statistics or biomedical sciences. Participants are expected to have a basic knowledge of statistical data analysis and inference. Some familiarity with meta-analysis methods is also recommended. In addition, participants should have an intermediate level of proficiency in R (e.g., writing functions, importing and manipulating data, and data visualization).

    Learning Objectives
    The main objective of the course is to provide participants with a modern theoretical and statistical framework for conducting reproducible, replicable, and robust science. In particular, participants will: Understand the current status of the scientific literature in the context of the replicability crisis, with a focus on biomedical sciences, neuroscience, and behavioral sciences. Learn best practices for conducting reproducible science using open-source, powerful tools within the R environment. Understand how to practically assess a replication study from both theoretical and statistical perspectives. Learn how to practically evaluate the robustness of scientific findings using the multiverse analysis approach. For each module, the most recent and relevant R packages will be presented alongside practical applications.

    Laptop
    Not needed.

    About the Instructors
    More information coming soon.

    SC08: Topological and Object-Oriented Data Analysis

    Moo Chung; Ian Dryden

    COURSE INFORMATION

    Abstract

    The era of big data presents new opportunities for scientific discovery, but valuable patterns often lie in complex structures beyond the reach of traditional statistical tools. Classical methods, grounded in Euclidean assumptions and linearity, struggle to capture the geometric and topological complexity of modern data such as brain networks, genetic pathways, protein structures, and arterial trees. These challenges call for analytic frameworks that accommodate irregular geometries, reveal multiscale structure, and maintain robustness across domains. Topological Data Analysis (TDA) has emerged as a powerful response. By tracking topological changes across scales, TDA reveals persistent features that are robust to noise and invariant under nonlinear transformations. Its core technique, persistent homology, provides coordinate-free summaries that highlight global structure. Now widely used in biomedical imaging, signal processing, and systems biology, TDA is also central to Object-Oriented Data Analysis (OODA), which generalizes statistics to complex data types like shapes and trees. Together, TDA and OODA extend classical methods to the analysis of nonlinear, structured data. This short course will introduce recent theoretical and computational advances in both areas, with emphasis on practical applications to imaging data, biomedical time series, genomic sequences, and protein structures. We aim to demystify topological methods and popularize their use by demonstrating their relevance through hands-on tutorials with real data and open-source R/MATLAB code. The course is timely and necessary given the rising adoption of TDA across biomedical fields, the growing availability of high-dimensional data modalities, and increased interest in scalable, interpretable tools that go beyond conventional models.

     
    Two experts will lead this half-day course, each delivering two one-hour sessions—one on theory and one on applications. Marron will cover the foundations of Object-Oriented Data Analysis (OODA) and its application to biomedical data such as brain artery trees. Chung will introduce Topological Data Analysis (TDA), including persistent homology and its visualizations, and demonstrate TDA applications to EEG, fMRI, network, and genomic data using R/MATLAB.

     

    Prerequisites

    This course is accessible to participants with a quantitative background. Prior exposure to basic statistics, linear algebra, and calculus—especially matrix notation, probability, and mathematical reasoning—will be helpful. The course is largely self-contained and introduces key mathematical and computational ideas intuitively and with practical motivation. No prior knowledge of topology is assumed.

    Learning Objectives

    By the end of the course, participants are expected to:1. Understand the fundamental concepts and principles underlying Object-Oriented Data Analysis (OODA) and Topological Data Analysis (TDA).2. Gain familiarity with appropriate statistical procedures for analyzing complex, non-Euclidean data using OODA and TDA frameworks.3. Apply theoretical concepts to real-world biomedical data, including time series, image, networks, and genomic data.4. Develop practical skills in implementing OODA and TDA methods using the provided R and MATLAB codes.

    About the Instructors

    More information coming soon.

    Statistics in Practice Session

    An IBC tradition, one complimentary short course is typically offered to all attendees. Congratulations to Prof. Giusi Moffa, who has been honored with the opportunity to present our 2026 Stats in Practice Session! This session is currently scheduled for Tuesday, 14 July (schedule subject to change). 

    Implementing causal discovery in epidemiological and clinical research

    Giusi Moffa; Jack Kuipers; Enrico Giudice

    COURSE INFORMATION

    Abstract

    Causal discovery sits at the intersection of statistics, machine learning, and artificial intelligence (AI), leveraging data-driven algorithms to complement domain expertise for constructing causal diagrams that represent complex real-world systems. Causal diagrams help synthesize large volumes of information and support explainable AI (xAI). By building causal models, we can address interventional and counterfactual questions. While not providing conclusive proof of the causal relationships, we show how causal discovery can be a valuable tool for a range of tasks in biometrics, offering insights for decision-making in areas ranging from medical treatment to policy development, generating new clinical hypotheses and guiding the design of follow-up studies that explore the causal hypotheses in greater depth. Despite tremendous methodological advances in recent years, uptake in real-world applications remains limited, partly due to the perceived complexity of the methods and the scarcity of accessible training. This course is designed as a starter kit for researchers and practitioners in any area of biometrics to enter the world of causal discovery and explore the potential of combining expert knowledge with automated methods to build causal models from real-world data.
     
    The course is structured into three sessions. The first session will define the terminology of causal graphs before introducing the fundamentals of causal discovery, including the concept of equivalence classes and the most common constraint- and score-based algorithms, such as PC (Peter and Clark) and GES (Greedy Equivalent Search), along with their underlying assumptions. The second session will focus on state-of-the-art sampling-based Bayesian causal discovery methods, which naturally account for uncertainty of the causal diagram in their results, such as the estimation of intervention effects. The third session will demonstrate a reproducible workflow that integrates domain expertise with data to find plausible causal diagrams for a data-generating mechanism and draw insights into a health science phenomenon under investigation. Participants will engage in hands-on exercises throughout the course to reinforce and apply the concepts covered. The session will conclude with a novel application in a real-world clinical setting, using data from the Novartis-Oxford multiple sclerosis database, showcasing how the results provide insights into which biological pathways are likely involved in disease progression. For illustration, we will use case studies from epidemiology and clinical research, where we wish to understand the interplay between personal characteristics, symptoms, treatment, risk factors and clinical outcomes. The analytical pipeline will use R packages, including pcalg, BiDAG and Bestie. The course presents a complete, reproducible workflow that combines expert knowledge with data to find plausible causal diagrams, simulate interventions, and estimate putative intervention effects.


    Prerequisites

    Knowledge of core statistical concepts is expected. Prior exposure to causal inference terminology and central concepts in causal modelling, including diagrams, is beneficial but not mandatory. No previous experience with causal discovery is required. Familiarity with the R statistical software is a prerequisite for effectively engaging in the hands-on tutorial. Participants should bring their own device for the practical activities.

    Learning Objectives

    Participants will gain hands-on experience applying causal discovery algorithms in R and interpreting their outputs. By the end of the course, attendees will be equipped to carry out a reproducible, end-to-end analysis, from data to inference, using causal diagrams. Specifically, they will be able to:

    - Explain the core components of causal discovery and inference through causal diagrams.

    - Recognise and understand the key concepts of different approaches: constraint-, score- and (Bayesian) sampling-based methods for causal discovery.

    - Implement and compare various algorithms for causal discovery using R.

    - Discuss the strengths and limitations of each approach.- Interpret results to draw insights into the underlying data-generating mechanism. Examples include whether intervening on a variable is likely to affect an outcome, whether there may be confounding in the observed association between an exposure and an outcome or whether mediators exist on a path from an exposure to an outcome.

    Spatial-referenced data is a common feature in biometric applications, but despite extensive research in this area, there remain several crucial issues that require further attention. With the advancement of technology, spatial data has become more abundant, and this new perspective has allowed for the collection of diverse data of interest in the same region, leading to an increased focus on multivariate models.
     
    While there are methodologies in the literature to deal with multiple responses, there is still ample scope for research and improvement, such as a deeper understanding of existing techniques and the development of scalable models to handle the growing volume of observed data. Additionally, it is important to note that some response variables are not normally distributed, and non-Gaussian and discrete processes are essential for realistic modeling. Moreover, spatial data can be measured at different levels, such as point, satellite grids, or political divisions, creating a diverse range of data. Since the scale of these measurement levels is not necessarily uniform, techniques for dealing with spatial misalignment, data fusion, and prediction for area data are critical for creating models that utilize all available information effectively.
     
    In summary, the research directions for spatial-referenced data in biometric applications include the development of spatially explicit models that incorporate spatial dependencies and environmental covariates, spatial-temporal modeling, and methods for dealing with non-normality, diverse measurement levels, and spatial misalignment.

    About the Instructors
    See above.