About this document
This text corresponds to the slides of the Module 1 of the Simulation-based inference workshop held in 2021. Head on to the workshop page for the rest of the content.
🄯 Álvaro Tejero-Cantero for all the text, licensed under a CC-BY-SA license.
ℹ️ Practical parts are indicated in red background. Speaker's notes are under ▸ §.
This is the initial list of learning goals we set ourselves for this introduction.
After this module, the learner should
- understand what is the problem that SBI tries to solve,
- have seen practical examples of a number of simulators,
- understand ABC and its limitations, and
- know how (neural) density estimation comes up in SBI.
1. Simulators
- understand the role of simulators in science and their limitations as a modelling paradigm
- Be able to identify a variety of simulation approaches, and categorize precisely (ODE, ABM...) their simulator of interest
- Recognize the need to represent stochasticity and identify ways in which random numbers can appear in models
2. Parameter inference
- Understand what is an inverse problem and that functions are not the right paradigm to represent the solution.
- Know examples and reflect about the interpretation of multiple parameter sets being compatible with a given observation.
3. Bayesian inference w/o likelihoods
- list non-Bayesian approaches to the inverse problem - grid/brute forcing, genetic, MLE + MAP (optimization
- Understand how Bayesian posterior inference provides a framework for the inverse problem
- Be able to articulate how this approach provides a basis for a decision-theoretic approach to simulation as part of a process.
- Identify sources of intractability for likelihoods that call for simulation-based inference.
- Be able to recognise scenarios of black-box inference; conversely, identify the availability of additional information that may make other approaches ($\neq$SBI) possible/desirable.
- Describe what is ABC and what is problematic about it.
- Define what are summary statistics.
- Simulators for science
- Models in science
- Simple pendulum
- Simulators, everywhere
- Simulators galore
- Agent-based models
- Differential-equation-based models
- Simulator example: core-collapse supernova models
- Epistemic ambition of simulators
- Interlude: introduce your simulator
- Simulators vs. statistical models
- Models, and models
- Example: modelling counts vs. modelling processes
- Induction vs. deduction, an age-old dichotomy
- Inference on simulators
- An inverse problem
- Multiple and uncertain parameter sets
- Bayesian parameter inference
- Introducing the posterior
- Interlude: Bayesian inference
- The Bayesian workflow
- Bayesian inference
- Simulators as statistical models
- Stochasticity and simulators
- Simulator as generative models
- Anatomy of a simulator: notation
- Some notes about the parameters ⁍ → latents ⁍ → outputs ⁍
- Latent-path intractability
- Partial observability in the field
- Ways to make likelihoods tractable
- Levels of simulator access
- Simulation-based inference
- LFI, ABC, SBI, FBI, CIA, WHO?
- Interlude: from ABC to SBI
- Conditional density estimation
- Amortized SBI is density estimation
- Advantages derived from using networks
- Further materials
- Resources
Simulators for science
Models in science
- making models is part of the scientific method
- models capture only some aspects of reality
- when formalized, they enable quantitative, testable hypotheses
- model functionalities
- prediction — to support decisions
- understanding — to select interventions
- the structure that doesn't change is the model
- the malleable part are parameters
- parameters are 'tuned' based on observations
- multiple input parameter sets can lead to the same output prediction
- equifinality, degeneracy are key to resilience, homeostasis of complex systems
Simple pendulum
$\frac{{\rm d}^2 \theta}{{\rm d} t^2} + \frac{g}{\ell}\, \sin\theta = 0$- Can predict angles $\theta(t)$ given $g/\ell$ and $\theta(0)$.
- Can infer $g/\ell$ from measured $\theta(t)$.
- for small amplitudes $\sin\theta\simeq\theta$ and timing one oscillation $T=2\pi \sqrt{\ell/g}$ approximately suffices to infer $g/\ell$ → $T$ summarises $\theta(t)$ for inference.
- And extract understading wrt. interventions and counterfactuals.
How many slides does it take a physicist to say harmonic oscillator?
No friction, rigid massless cord, massless bob, 2D, uniform g, fixed support:
- $T$ measured directly or estimated from traces $\theta(t)$.
- model affords understanding (role of $g,\ell$, insight on the dynamics, e.g. conservation laws)
- via analytical manipulations or,
- via numerical integration and post-hoc analysis of results.
Simulators, everywhere
- 20th-century: explosion of digital, expressive simulators → "complex science"
- "simulator as numerical solver" for an explicit model (e.g. PDEs) — based on discretization
- "simulator as defined by code", an implicit model built from individual interaction rules
- and anything in-between, e.g. pulse-coupled NNs: discrete + continuous dynamics
Simulation is possible (human, analog) without (digital) computers, but not practical.
- Development of hardware and software driving factors of simulation-based research
Simulators galore
But what are simulators?
- simulate - /ˈsɪm·jəˌleɪt/ (verb). (Cambridge English dictionary)
- to produce a situation or event that seems real but is not real, especially in order to help people learn how to deal with such situations or events
- simulation might be a key ingredient of cognitive processing? cf. predictive coding. (imagination, speculation, platonic shadow)
- typology: continuous vs. discrete (regular vs. event-based), dynamic vs. steady-state, deterministic vs. stochastic...
- Let's look at a couple of examples
1. create conditions or processes similar to something that exists.
" (...) the brain encodes top-down generative models at various temporal and spatial scales in order to predict and effectively suppress sensory inputs rising up from lower levels. A comparison between predictions (priors) and sensory input (likelihood) yields a difference measure (e.g. prediction error, free energy, or surprise) which, if it is sufficiently large beyond the levels of expected statistical noise, will cause the generative model to update so that it better predicts sensory input in the future." (https://en.wikipedia.org/wiki/Predictive_coding)
Agent-based models
Microscale models "defined by code". Agents represented directly, not by density or concentration; possess internal state, interaction rules, and learning processes that determine state updates; they live in a topology embedded in an environment.
Other examples: culture modelling, tumor growth, epidemics, ecology, traffic, percolation (oil through soil), voting...
See cellular automata handout by Matthew McAuley.
Differential-equation-based models
- continuous models for dynamically changing phenomena
- relate system response to infinitesimal changes of state variable(s)
- classes of Diferential equation (DE):
- Solution via integration is rarely possible in closed form
ODE: ordinary DE: one variable $y$, $F\left (x,y,y',\ldots, y^{(n-1)} \right )=y^{(n)}$ PDE: partial DE: multiple variables, e.g. time and space SDE: stochastic: DE with stochastic fluctuations (→ stochastic process)
→ numerical approximations: discretization.
Simulator example: core-collapse supernova models
Goal. Test understanding of physics. What are the mechanisms that make SN explode?
Model. Euler eqs. + eq. of state + conservation of mass (1), momentum (2), energy (3)
Observables. lightcurves, neutrinos, EM spectrum (🤯 SN remnants visible for centuries, but relevant dynamics on $\lesssim 1\,{\rm s}$)
Solution.
- PDEs → Finite Elements Model (FEM)
- "parameter-free"
- months on high-performance computers
Supernova simulation 400ms post-bounce, courtesy Thomas Janka →
For ABM-DE comparisons see Breitenecker's PhD thesis (2014) and Figueredo et al. (2014).
Epistemic ambition of simulators
🖍️ Theory building — focus on process
- Formalizing heuristics for understanding, running simulator to generate hypotheses.
- Use of reduced models to understand dynamical degrees of freedom and interactions.
🖥️ Hypothesis testing — focus on result
Expected to quantitatively reproduce empirical measurements. Our workshop.
Problem: model misspecification
"Explanation requires reduction", all models are misspecified.
Parsimony (modelling "from the null up", Premo, 2007) vs. ommitted variable bias.
It's difficult to know whether lack of fit is due to misspecification or bad inference.
Model comparison out of this workshop's scope. Bayes factors problematic.
In Explaining the Past with ABM: On Modelling Philosophy, by Mark W. Lake in Agent-based Modeling and Simulation in Archaeology Ed Reschreiter, Wurzer, Kowarik.
Learning by Simulation Requires Experimentation not Just Emulation
The most informative models are generative ("What I cannot create, I do not understand. - Feynman 1988")
Explanatory Power Trades Complexity Against Fit.
Interlude: introduce your simulator
Share your simulator-based research with the group. You can re-use slides you have. Paste them here in this Google presentation.
- Key features of your use case (use up to 2 minutes):
- why: phenomenon and scientific meaning of parameters and observations. Do you expect degeneracies / equifinality? what would they mean?
- what: type, dimensionality and structure of parameters and observations, sources of noise
- how: type of simulator (ODE, ABM...), programming language
- when: what is the approximate runtime?
- questions from the audience (about 1m).
Simulators vs. statistical models
Models, and models
Simulations are models (often stochastic), but what about 'statistical models'. Differences?
Models can either be mechanistic or convenient (Ripley 1987, Stochastic Simulation p3).
Statistical models
"Accommodate and fit data"
Statistical relationship, correlation
Interpretable post-hoc, if at all
Poor performance outside of training set
Less opinionated — needs much data
Built for inference ('fit')
Mechanistic models (many simulators)
"Formalize principles and hypotheses".
I→O mechanistic link, causal if time involved
Interpretable by design
Extrapolate well (if validated)
More opinionated — works with little data
Inference ❓
Provisory terminology: forward models $\simeq$ mechanistic models $\simeq$ simulators. Simulators do not need to be mechanistic, but most often are. Interpretable parameters have physical units and known interactions with other aspects of reality. But simulator parameters can also be meaningful abstractions (e.g. 'utility' in a microeconomic simulator). They still have a mechanistic role, stipulated by the relations they obey with other quantities in the simulator, but they represent concepts, operationalizations or aggregates and cannot be traced down to more primitive laws. Mechanistic models have strong causality properties as long as they explicitly model time (as the independent variable by excellence). Time-ordering is a proxy for causality (succession $\not \Rightarrow$ causality but ¬ succession $\Rightarrow$ ¬ causality. See Schölkopf et al. (2021). See also Baker et al. 2018 (https://royalsocietypublishing.org/doi/pdf/10.1098/rsbl.2017.0660) No black/white: models can be written entirely as stochastic statistical models on individuals (especially relevant for low counts) — and inference can be made analytical, with much work. ⚠️ In older stats literature stochastic simulation connotes Monte Carlo methods only. Causal models look at Schölkopf 2021
Example: modelling counts vs. modelling processes
RNN model of infection counts
Designed to fit data and predict
- RNN flexible approximator,
- specific architecture to fit time correlations
- many, $\mathcal{O}(10^4)$, opaque parameters
- model hard to debug and interpret
- data-hungry,
- poor extrapolation
Epidemic compartmental model
Designed for insight
- ODE, limited modelling repertoire
- use rate constants for population kinetics
- few, $\mathcal{O}(1)$ interpretable parameters
- expresses interaction rules, conservation laws, smoothness constraints.
- misspecification risk, but know how to formulate hypotheses as compartments
Different expectations: predictive performance, vs. interpretability - different design:
- Statistical models (esp., NNs) driven by: number and type of parameters needed to accomodate the data (scales with the data).
- of mechanistic models: driven by the number and type of parameters that fit the hypotheses (scales with the hypotheses) / and that I can interpret either before or after fitting the model (scales with interpretive power).
- https://datascience.stackexchange.com/questions/10615/number-of-parameters-in-an-lstm-model
Induction vs. deduction, an age-old dichotomy
Data reduction vs. theory making. Philosophy of science as a guide.
- Induction $\sim$ statistical models: from experimental data, infer the simplest (usually statistical) laws that account best for the data.
- Deduction $\sim$ mechanistic models: from first principles, obtain laws that produce predictions in specific settings. Experimental design seeks measurements that falsify the laws.
Cartoonish duality: there exists crossover between mechanistic stats models (e.g. Neural ODEs). No principles without experiments (sense data) and likewise no experimental data without a physical theory of measurement, and perhaps even a model of cognition (cf. Criticism and the growth of knowledge, Ed. Lakatos & Musgrave, 1970.)
Inference on simulators
An inverse problem
- from parameters to measurements: modelization, simulation: forward problem
- from measurements to parameters: parameter estimation: inverse problem
- point estimates via optimization
- grid search (not scalable)
- derivative-based optimization (gradients generally not available)
- population methods: evolutionary strategies
- share: no principled ranking of point estimates
Multiple and uncertain parameter sets
- functions destroy information, when they are noninjective (underdetermination, equifinality)
- $\hat{\theta}$ replicates observation $x_{\rm o}$, i.e. $x_{\rm o} = \operatorname{sim} \hat{\theta}$. But what about $\operatorname{sim}(\hat{\theta} + \epsilon)$
- this could be a serious problem
- sensitivity analysis is the discipline dealing with the $\operatorname{sim} (\hat{\theta}+\epsilon)$ problem
- solution? estimate $\hat{\theta}(x)$, calculate confidence sets
- uncertainty estimate ✔️
- find the multiple $\hat{\theta}$ which could have generated $x_{\rm o}$ (and rank them) ❌
- The "identification" problem (as inference)
- system identification tends to apply to dynamical systems and mixes the finding out of empirical dynamical laws with the fitting of its parameters from data.
- Systems often not even structurally identifiable? https://hal.archives-ouvertes.fr/hal-02995562
Bayesian parameter inference
uncertainty: getting $\theta$ might be only one step in an inference or decision pipeline.
degeneracy: ignoring alternate $\theta$ leading to the same $x_{\rm o}$ discards alternative explanations upfront.
A full posterior parameter distribution conditioned on the observation, $p(\theta|x_{\rm o})$ provides both.
- But how to get there?
- And what to do with them?
simulator data + observation + inference 🪄 ($\sim$day 2 of the workshop)→ posterior
posterior + validation / interpretation ($\sim$ day 3 of the workshop) 🪄 → hypotheses, results
Introducing the posterior
For $\theta \in \mathbb{R}^{D>2}$ we need strategies for representation. Ground truth $\orange{\bullet}\equiv\{\orange{\theta_1, \theta_2}\}$
- marginal posteriors in 2D 🔨 integrate out $D-2$ parameter dimensions
$p(\theta_1, \theta_2|x_{\rm o})=\green{\int} p(\theta_1,\theta_2|x_{\rm o}, \green{\theta_{3:31}})\ \green{{\rm d}\theta_{3:31}}$
typically more like a blob!
Posterior plots courtesy Michael Deistler →
- conditioned posteriors in two dimensions
🏖️ fix $D - 2$ parameter dimensions
$p(\theta_1,\theta_2|x_{\rm o}, \color{green}{\theta_{3:31}\leftarrow\theta^\ast})$
typically sharper!
Interlude: Bayesian inference
The Bayesian workflow
Assume we want to know $\theta$ and have measured data on $x$.
- Modeling (day 1): formulate joint pdf $p(x,\theta)$
- product rule: $p(x,\theta) = p(x|\theta) p(\theta)=p(\theta|x)p(x)$.
- prior $p(\theta)$ summarizes our knowledge of the parameters (e.g. 'must be positive')
- likelihood $p(x|\theta)$ embodies our knowledge of how $x$ is generated from $\theta$ (→ model)
- rinse and repeat (update the prior: the posterior is the new prior)
- Inference (day 2): Use the data $x_{1:N}$ to learn about the target variable $\theta$ by conditioning
- Typically, use $x_{1:N}$ to fix $\phi$ in $p_\phi(\theta|x)$
- Multiple computational approaches, depending on the problem
- Validation and interpretation (day 3)
- check internal consistency of the posterior (SBC, posterior-predictive checks)
- check consistency with domain knowledge
See: Bayesian Worklflow, Gelman et al. 2020 (https://arxiv.org/abs/2011.01808).
Bayesian inference
$\color{0B6E99}{p(\theta|x)} =\frac{ \color{E03E3E}{p (x|\theta)}\color{6940A5}{p(\theta)}}{\color{9B9A97}{\int_\theta p(x|\theta)\ p(\theta)}}$The Iikelihood (model) times the prior divided by the evidence yields the posterior.
- For very few models this equation yields a posterior in closed form.
- a closed-form posterior can be sampled from and evaluated. Ideal case!
- Usually the evidence integral is hard to compute
- Markov-Chain Monte-Carlo (stochastic, asymptotically exact), works with unnormalized joint. Implicit $\color{0B6E99}{p(\theta|x)}$: low bias / high-variance (slow). Only samples!
- Variational inference (approximate, deterministic). Assume explicit parametric posterior, fast.
Simulators as statistical models
Stochasticity and simulators
- Typically stochasticity models all the processes where we lack specific mechanistic hypotheses
- unless the process is itself stochastic at the physical level, e.g. $\beta$ decay in a nucleus.
- What are sources of stochasticity?
- unobserved latent variables
- stochastic program paths
- instrument noise (aleatoric)
- numerical approximations (→ probabilistic numerics)
→ Outputs must then be stochastic themselves - random variables → probabilities!
Since simulators have a stochastic component, could we treat simulators as statistical models?
Note: numerical approximations deterministic but uncertain due to discreteness, limited compute.
Simulator as generative models
$\color{0B6E99}{p(\theta|x)} =\frac{ \color{E03E3E}{p (x|\theta)}\color{6940A5}{p(\theta)}}{\color{9B9A97}{\int_\theta p(x|\theta)\ p(\theta)}}$- $\color{E03E3E}p(x|\theta)$ usually not evaluatable for simulators, as not built for inference.
- generative models build a probability density over samples — often an implicit one! (e.g. GANs)
- simulators are generative models with implicit likelihood. Most general scenario for inference - inference just from samples - likelihood-free inference (LFI).
- let's look at simulators from a probabilistic perspective
Main concern when writing simulators is to represent mechanistic processes accurately, not inferential tractability. Producing samples is trivial (= running) but evaluating their probabilities is often intractable.
Constructing an ersatz likelihood with classical density estimation (KDE/histograms) is possible but useless for interesting cases (curse of dimensionality).
Anatomy of a simulator: notation
Probabilistic programs are programs with stochasticity that are interpreted as statistical models. In this sense, simulators are probabilistic programs, computer programs that
- take parameter vector as input, $\theta$ — we treat $\theta$ as a random variable, and assign it prior $p(\theta)$
- compute internal states — latent variables $z_\ell \sim p_\ell(z_\ell|\theta,z_{<\ell})$
- produce result vectors $x$ comparable with experimental observations $x_{\rm o}\sim p(x|\theta_{\rm true})$
Since simulators have often a stochastic component and often no explicit functional form, we can borrow 'sampling' notation, $x \sim p(x|\theta,z)$.
Some notes about the parameters $\theta$ → latents $z_{1:L}$ → outputs $x$
- $\theta$ fixed dimensionality, typically no structure
- $x$ can be structured (e.g. images, graphs...), and high-dimensional
- $z$ correspond to meaningful states, but are typically unobservable.
- continuous or discrete, changing dimensionality (even during simulation)
- updated deterministically, or stochastically
- we will NOT infer $z$ here
- but the existence of this unobserved state is problematic...
Often $\theta$ mix of variegated parameters. Structured $\theta$ is possible but unusual → camera model. Prior for $\theta$ usually explicit; there are methods that can work with implicit prior (just samples). Sampling efficiency can improve drastically with access to simulator internals: gradients and / or probabilities conditioned on latents. See Cranmer et al. 2020 and "Mining gold" papers (G. Louppe's lab). Sometimes $z$ is accessible, sometimes the simulator is a black box. AD great progress in differentiating through randomly decided control flow.
Latent-path intractability
Let's talk about $\color{E03E3E}{p(x|\theta)}$. How can a simulator become intractable, beyond lack of normalization?
We seek likelihood-free techniques to free us from latent-path intractability; ideally they'd also solve the normalization problem.
Partial observability in the field
Another name is partial observability. Here's a case from epidemiology (Kulkarni et al. 2021, discrete compartmental model)
Ways to make likelihoods tractable
- model reduction / coarsening. different model / limits expressivity, model-specific ❌
- likelihood data augmentation. model-specific, iterative ❌
- martingales. elegant ✔️ few specific models, hard ❌
- don't? likelihood-free inference: generally applicable, can target posterior✔️ performance❓, cutting edge❓
(...) introduce additional parameters $\psi$, which represent missing data, in such a way that the likelihood $p(x,\psi|\theta)$ is tractable. Inference then proceeds by estimating both $\theta$ and $\psi$, typically via EM or MCMC. (after O'Neill 2010)
⚠️ Not an expert here ☠️. Consult data augmentation (DA) literature: O'Neill (2010) for the categories, Fintzi et al. (2017) for criticism of reduction (p2-p3) and Bu et al. (2020) for an advanced example. Vono (2020) says about DA: introducing appropriate auxiliary variables while preserving the initial target probability distribution and offering a computationally efficient inference cannot be conducted in a systematic way.
Levels of simulator access
Think: how much do we know about a simulator? i.e. mathematical model, code accessible, run-time accessible, gradients and other quantities...
Your simulator access
Analytic Iikelihood (w. gradients maybe)
Analytic conditional likelihood $\color{orange}{p(x|z,\theta)}$
import sim
(source code)
x = GET(θ)
(just samples)
Inference strategy
Variational inference, MCMC (use gradients)
Data augmentation, martingales, augmented LFI
Source-level AD (LLVM, Julia) $+$ augmented LFI
Probabilistic programming (inference compilation)
▶️Bare likelihood-free inference◀️ our workshop
See Cranmer et al. 2020 about LFI augmentation strategies for improved sampling efficiency.
LFI "gold" (p5): I. $\color{orange}{p(x|z,\theta)}$, II. $\nabla_\theta \log p(x,z|\theta)$, III. $\nabla_z \log p(x,z|\theta)$, IV. $r(x,z|\theta,\theta'):=p(x,z|\theta)/p(x,z|\theta')$, V. $\nabla_\theta (x,z)$, VI. $\nabla_z\ x$. Use AD to get derivatives.
Simulation-based inference
LFI, ABC, SBI, FBI, CIA, WHO?
LFI — Likelihood-free inference. Any technique not requiring likelihoods
ABC Approximate Bayesian Computation — population genetics (Tavaré et al. 1997); use of sampling, implicit posteriors.
SBI SampleSimulation-based inference — 💡 non ABC LFI. Modern LFI, often using NNs.
Let's have a hands-on look in a simplified case.
Neural networks bring evaluatable densities, flexible parameterization and automatic feature engineering using architectural bias adapted to data equivariances.
Interlude: from ABC to SBI
Open notebook for a first practical contact with ABC/SBI.
Conditional density estimation
Amortized SBI is density estimation
Amortization is sharing parameters across models for different predictions.
Conditional density estimation essentially provides us with amortized sbi.
With flexible neural networks we can estimate
- the likelihood (emulation). Prior-independent, but need still MCMC, VI for the evidence.
- the posterior directly.
How to reduce distance: e.g. KL divergence - guide programs (mode seeking) vs. inference networks (mode covering)?
Advantages derived from using networks
- Feature learning, can incorporate inductive biases
- Scales well with more data
- Interpolation properties
- Differentiable
Further materials
- Slides for the next sessions are available on GitHub
Resources
- Review. Cranmer et al. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48), 30055–30062.
- Software. Tejero-Cantero et al., (2020). sbi: A toolkit for simulation-based inference. Journal of Open Source Software, 5(52), 2505 — 🧑🏽💻 https://github.com/mackelab/sbi
- Application (neuroscience). Gonçalves et al. (2020). Training deep neural density estimators to identify mechanistic models of neural dynamics. ELife, 9, e56261.