🎚️

Introduction to sbi: simulation & inference

About this document

This text corresponds to the slides of the Module 1 of the Simulation-based inference workshop held in 2021. Head on to the workshop page for the rest of the content.

🄯 Álvaro Tejero-Cantero for all the text, licensed under a CC-BY-SA license.

ℹ️ Practical parts are indicated in red background. Speaker's notes are under ▸ §.

‣

Module's learning goals

‣

Table of contents

Simulators for science

Models in science

making models is part of the scientific method

models capture only some aspects of reality
when formalized, they enable quantitative, testable hypotheses

model functionalities

prediction — to support decisions
understanding — to select interventions

the structure that doesn't change is the model

the malleable part are parameters
parameters are 'tuned' based on observations

multiple input parameter sets can lead to the same output prediction

equifinality, degeneracy are key to resilience, homeostasis of complex systems

Simple pendulum

\frac{{\rm d}^2 \theta}{{\rm d} t^2} + \frac{g}{\ell}\, \sin\theta = 0

Can predict angles $\theta(t)$ given $g/\ell$ and $\theta(0)$ .
Can infer $g/\ell$ from measured $\theta(t)$ .

for small amplitudes $\sin\theta\simeq\theta$ and timing one oscillation $T=2\pi \sqrt{\ell/g}$ approximately suffices to infer $g/\ell$ → $T$ summarises $\theta(t)$ for inference.

And extract understading wrt. interventions and counterfactuals.

‣

Simulators, everywhere

20th-century: explosion of digital, expressive simulators → "complex science"

"simulator as numerical solver" for an explicit model (e.g. PDEs) — based on discretization
"simulator as defined by code", an implicit model built from individual interaction rules
and anything in-between, e.g. pulse-coupled NNs: discrete + continuous dynamics

‣

Simulators galore

But what are simulators?

simulate - /ˈsɪm·jəˌleɪt/ (verb). (Cambridge English dictionary)

1. create conditions or processes similar to something that exists.

to produce a situation or event that seems real but is not real, especially in order to help people learn how to deal with such situations or events

simulation might be a key ingredient of cognitive processing? cf. predictive coding. (imagination, speculation, platonic shadow)
typology: continuous vs. discrete (regular vs. event-based), dynamic vs. steady-state, deterministic vs. stochastic...
Let's look at a couple of examples

‣

Agent-based models

Microscale models "defined by code". Agents represented directly, not by density or concentration; possess internal state, interaction rules, and learning processes that determine state updates; they live in a topology embedded in an environment.

SIR ABM running (w. social distancing, isolated=0.8).

Parameters of SIR model in Agents.jl.

Other examples: culture modelling, tumor growth, epidemics, ecology, traffic, percolation (oil through soil), voting...

‣

Differential-equation-based models

continuous models for dynamically changing phenomena
relate system response to infinitesimal changes of state variable(s)
classes of Diferential equation (DE):

ODE: ordinary DE: one variable $y$ , $F\left (x,y,y',\ldots, y^{(n-1)} \right )=y^{(n)}$ PDE: partial DE: multiple variables, e.g. time and space SDE: stochastic: DE with stochastic fluctuations (→ stochastic process)

Solution via integration is rarely possible in closed form

→ numerical approximations: discretization.

Simulator example: core-collapse supernova models

Goal. Test understanding of physics. What are the mechanisms that make SN explode?

Model. Euler eqs. + eq. of state + conservation of mass (1), momentum (2), energy (3)

\small{\begin{align} \partial_t \rho + \nabla (\rho \bm{v}) &= 0 \\ \partial_t (\rho \bm{v}) + (\nabla^\intercal (\rho \bm{v} \bm{v}^\intercal))^\intercal + \nabla p &= -\rho\, \nabla \Phi\\ \partial_t (\rho E) + \nabla^\intercal ((\rho E + p) \bm{v}) &= -\rho \bm{v}^\intercal \nabla \Phi + Q_{\nu} + Q_{\rm nuc} \end{align}}

Observables. lightcurves, neutrinos, EM spectrum (🤯 SN remnants visible for centuries, but relevant dynamics on $\lesssim 1\,{\rm s}$ )

Solution.

PDEs → Finite Elements Model (FEM)
"parameter-free"
months on high-performance computers

Supernova simulation 400ms post-bounce, courtesy Thomas Janka →

‣

Epistemic ambition of simulators

🖍️ Theory building — focus on process

Formalizing heuristics for understanding, running simulator to generate hypotheses.
Use of reduced models to understand dynamical degrees of freedom and interactions.

🖥️ Hypothesis testing — focus on result

Expected to quantitatively reproduce empirical measurements. Our workshop.

Problem: model misspecification

"Explanation requires reduction", all models are misspecified.

Parsimony (modelling "from the null up", Premo, 2007) vs. ommitted variable bias.

‣

Interlude: introduce your simulator

Share your simulator-based research with the group. You can re-use slides you have. Paste them here in this Google presentation.

Key features of your use case (use up to 2 minutes):

why: phenomenon and scientific meaning of parameters and observations. Do you expect degeneracies / equifinality? what would they mean?
what: type, dimensionality and structure of parameters and observations, sources of noise
how: type of simulator (ODE, ABM...), programming language
when: what is the approximate runtime?

questions from the audience (about 1m).

Simulators vs. statistical models

Models, and models

Simulations are models (often stochastic), but what about 'statistical models'. Differences?

Models can either be mechanistic or convenient (Ripley 1987, Stochastic Simulation p3).

Statistical models

"Accommodate and fit data"

Statistical relationship, correlation

Interpretable post-hoc, if at all

Poor performance outside of training set

Less opinionated — needs much data

Built for inference ('fit')

Mechanistic models (many simulators)

"Formalize principles and hypotheses".

I→O mechanistic link, causal if time involved

Interpretable by design

Extrapolate well (if validated)

More opinionated — works with little data

Inference ❓

‣

Example: modelling counts vs. modelling processes

RNN model of infection counts

Designed to fit data and predict

RNN flexible approximator,
specific architecture to fit time correlations
many, $\mathcal{O}(10^4)$ , opaque parameters
model hard to debug and interpret
data-hungry,
poor extrapolation

Recurrent Neural Network (P. Roelants)

Epidemic compartmental model

Designed for insight

ODE, limited modelling repertoire
use rate constants for population kinetics
few, $\mathcal{O}(1)$ interpretable parameters
expresses interaction rules, conservation laws, smoothness constraints.
misspecification risk, but know how to formulate hypotheses as compartments

SEIR compartmental model (Alex Gessner).

‣

Induction vs. deduction, an age-old dichotomy

Data reduction vs. theory making. Philosophy of science as a guide.

Induction $\sim$ statistical models: from experimental data, infer the simplest (usually statistical) laws that account best for the data.
Deduction $\sim$ mechanistic models: from first principles, obtain laws that produce predictions in specific settings. Experimental design seeks measurements that falsify the laws.

‣

Inference on simulators

An inverse problem

from parameters to measurements: modelization, simulation: forward problem
from measurements to parameters: parameter estimation: inverse problem
point estimates via optimization

grid search (not scalable)
derivative-based optimization (gradients generally not available)
population methods: evolutionary strategies

share: no principled ranking of point estimates

Multiple and uncertain parameter sets

functions destroy information, when they are noninjective (underdetermination, equifinality)
$\hat{\theta}$ replicates observation $x_{\rm o}$ , i.e. $x_{\rm o} = \operatorname{sim} \hat{\theta}$ . But what about $\operatorname{sim}(\hat{\theta} + \epsilon)$

this could be a serious problem
sensitivity analysis is the discipline dealing with the $\operatorname{sim} (\hat{\theta}+\epsilon)$ problem

solution? estimate $\hat{\theta}(x)$ , calculate confidence sets

uncertainty estimate ✔️
find the multiple $\hat{\theta}$ which could have generated $x_{\rm o}$ (and rank them) ❌

‣

Bayesian parameter inference

uncertainty: getting $\theta$ might be only one step in an inference or decision pipeline.

degeneracy: ignoring alternate $\theta$ leading to the same $x_{\rm o}$ discards alternative explanations upfront.

A full posterior parameter distribution conditioned on the observation, $p(\theta|x_{\rm o})$ provides both.

But how to get there?

simulator data + observation + inference 🪄 ( $\sim$ day 2 of the workshop)→ posterior

And what to do with them?

posterior + validation / interpretation ( $\sim$ day 3 of the workshop) 🪄 → hypotheses, results

Introducing the posterior

For $\theta \in \mathbb{R}^{D>2}$ we need strategies for representation. Ground truth $\orange{\bullet}\equiv\{\orange{\theta_1, \theta_2}\}$

marginal posteriors in 2D 🔨 integrate out $D-2$ parameter dimensions

$p(\theta_1, \theta_2|x_{\rm o})=\green{\int} p(\theta_1,\theta_2|x_{\rm o}, \green{\theta_{3:31}})\ \green{{\rm d}\theta_{3:31}}$

typically more like a blob!

Posterior plots courtesy Michael Deistler →

conditioned posteriors in two dimensions

🏖️ fix $D - 2$ parameter dimensions

$p(\theta_1,\theta_2|x_{\rm o}, \color{green}{\theta_{3:31}\leftarrow\theta^\ast})$

typically sharper!

Interlude: Bayesian inference

The Bayesian workflow

Assume we want to know $\theta$ and have measured data on $x$ .

Modeling (day 1): formulate joint pdf $p(x,\theta)$

product rule: $p(x,\theta) = p(x|\theta) p(\theta)=p(\theta|x)p(x)$ .
prior $p(\theta)$ summarizes our knowledge of the parameters (e.g. 'must be positive')
likelihood $p(x|\theta)$ embodies our knowledge of how $x$ is generated from $\theta$ (→ model)
rinse and repeat (update the prior: the posterior is the new prior)

Inference (day 2): Use the data $x_{1:N}$ to learn about the target variable $\theta$ by conditioning

Typically, use $x_{1:N}$ to fix $\phi$ in $p_\phi(\theta|x)$
Multiple computational approaches, depending on the problem

Validation and interpretation (day 3)

check internal consistency of the posterior (SBC, posterior-predictive checks)
check consistency with domain knowledge

See: Bayesian Worklflow, Gelman et al. 2020 (https://arxiv.org/abs/2011.01808).

Bayesian inference

\color{0B6E99}{p(\theta|x)} =\frac{ \color{E03E3E}{p (x|\theta)}\color{6940A5}{p(\theta)}}{\color{9B9A97}{\int_\theta p(x|\theta)\ p(\theta)}}

The Iikelihood (model) times the prior divided by the evidence yields the posterior.

For very few models this equation yields a posterior in closed form.

a closed-form posterior can be sampled from and evaluated. Ideal case!

Usually the evidence integral is hard to compute

Markov-Chain Monte-Carlo (stochastic, asymptotically exact), works with unnormalized joint. Implicit $\color{0B6E99}{p(\theta|x)}$ : low bias / high-variance (slow). Only samples!
Variational inference (approximate, deterministic). Assume explicit parametric posterior, fast.

Simulators as statistical models

Stochasticity and simulators

Typically stochasticity models all the processes where we lack specific mechanistic hypotheses

unless the process is itself stochastic at the physical level, e.g. $\beta$ decay in a nucleus.

What are sources of stochasticity?

unobserved latent variables

stochastic program paths

instrument noise (aleatoric)
numerical approximations (→ probabilistic numerics)

→ Outputs must then be stochastic themselves - random variables → probabilities!

Since simulators have a stochastic component, could we treat simulators as statistical models?

‣

Simulator as generative models

\color{0B6E99}{p(\theta|x)} =\frac{ \color{E03E3E}{p (x|\theta)}\color{6940A5}{p(\theta)}}{\color{9B9A97}{\int_\theta p(x|\theta)\ p(\theta)}}

$\color{E03E3E}p(x|\theta)$ usually not evaluatable for simulators, as not built for inference.
generative models build a probability density over samples — often an implicit one! (e.g. GANs)
simulators are generative models with implicit likelihood. Most general scenario for inference - inference just from samples - likelihood-free inference (LFI).
let's look at simulators from a probabilistic perspective

‣

Anatomy of a simulator: notation

Probabilistic programs are programs with stochasticity that are interpreted as statistical models. In this sense, simulators are probabilistic programs, computer programs that

take parameter vector as input, $\theta$ — we treat $\theta$ as a random variable, and assign it prior $p(\theta)$
compute internal states — latent variables $z_\ell \sim p_\ell(z_\ell|\theta,z_{<\ell})$
produce result vectors $x$ comparable with experimental observations $x_{\rm o}\sim p(x|\theta_{\rm true})$

Since simulators have often a stochastic component and often no explicit functional form, we can borrow 'sampling' notation, $x \sim p(x|\theta,z)$ .

Some notes about the parameters $\theta$ → latents $z_{1:L}$ → outputs $x$

$\theta$ fixed dimensionality, typically no structure
$x$ can be structured (e.g. images, graphs...), and high-dimensional
$z$ correspond to meaningful states, but are typically unobservable.

continuous or discrete, changing dimensionality (even during simulation)
updated deterministically, or stochastically
we will NOT infer $z$ here

but the existence of this unobserved state is problematic...

‣

Latent-path intractability

Let's talk about $\color{E03E3E}{p(x|\theta)}$ . How can a simulator become intractable, beyond lack of normalization?

\begin{align*}p(x|\theta) &= \int \color{orange}{p(x,z|\theta)}\, \mathrm{d}z \quad \color{grey}\textsf{(sum over execution traces)}\\ \color{orange}{p(x,z|\theta)} &= p(x|\theta,z)\prod_{1:L} p_\ell(z_\ell|\theta, z_{<\ell})\quad \color{grey}{\textsf{(if sequential gen.)}} \end{align*}

We seek likelihood-free techniques to free us from latent-path intractability; ideally they'd also solve the normalization problem.

Partial observability in the field

Another name is partial observability. Here's a case from epidemiology (Kulkarni et al. 2021, discrete compartmental model)

Three observed, three unobserved populations (arXiv:2106.15508 [cs.DC]).

Ways to make likelihoods tractable

model reduction / coarsening. different model / limits expressivity, model-specific ❌
likelihood data augmentation. model-specific, iterative ❌

(...) introduce additional parameters $\psi$ , which represent missing data, in such a way that the likelihood $p(x,\psi|\theta)$ is tractable. Inference then proceeds by estimating both $\theta$ and $\psi$ , typically via EM or MCMC. (after O'Neill 2010)

martingales. elegant ✔️ few specific models, hard ❌
don't? likelihood-free inference: generally applicable, can target posterior✔️ performance❓, cutting edge❓

‣

Levels of simulator access

Think: how much do we know about a simulator? i.e. mathematical model, code accessible, run-time accessible, gradients and other quantities...

Your simulator access

Analytic Iikelihood (w. gradients maybe)

Analytic conditional likelihood $\color{orange}{p(x|z,\theta)}$

import sim (source code)

x = GET(θ) (just samples)

Inference strategy

Variational inference, MCMC (use gradients)

Data augmentation, martingales, augmented LFI

Source-level AD (LLVM, Julia) $+$ augmented LFI

Probabilistic programming (inference compilation)

▶️Bare likelihood-free inference◀️ our workshop

See Cranmer et al. 2020 about LFI augmentation strategies for improved sampling efficiency.

See PyProb for inference compilation, Madminer for augmented LFI in HEP.

‣

Simulation-based inference

LFI, ABC, SBI, FBI, CIA, WHO?

LFI — Likelihood-free inference. Any technique not requiring likelihoods

ABC Approximate Bayesian Computation — population genetics (Tavaré et al. 1997); use of sampling, implicit posteriors.

SBI ~~Sample~~Simulation-based inference — 💡 non ABC LFI. Modern LFI, often using NNs.

Let's have a hands-on look in a simplified case.

‣

Interlude: from ABC to SBI

Open notebook for a first practical contact with ABC/SBI.

Conditional density estimation

Amortized SBI is density estimation

Amortization is sharing parameters across models for different predictions.

Conditional density estimation essentially provides us with amortized sbi.

With flexible neural networks we can estimate

the likelihood (emulation). Prior-independent, but need still MCMC, VI for the evidence.
the posterior directly.

‣

Advantages derived from using networks

Feature learning, can incorporate inductive biases
Scales well with more data
Interpolation properties
Differentiable

Further materials

Slides for the next sessions are available on GitHub

https://github.com/mlcolab/sbi-workshop

Resources

Review. Cranmer et al. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48), 30055–30062.
Software. Tejero-Cantero et al., (2020). sbi: A toolkit for simulation-based inference. Journal of Open Source Software, 5(52), 2505 — 🧑🏽‍💻 https://github.com/mackelab/sbi
Application (neuroscience). Gonçalves et al. (2020). Training deep neural density estimators to identify mechanistic models of neural dynamics. ELife, 9, e56261.