Logo
  • ML Training
  • Projects
  • Software
  • Publications
  • Past news
  • About us
Introduction to sbi: simulation & inference
🎚️

Introduction to sbi: simulation & inference

About this document

This text corresponds to the slides of the Module 1 of the Simulation-based inference workshop held in 2021. Head on to the workshop page for the rest of the content.

🄯 Álvaro Tejero-Cantero for all the text, licensed under a CC-BY-SA license.

image

ℹ️ Practical parts are indicated in red background. Speaker's notes are under ▸ §.

‣
Module's learning goals
‣
Table of contents

Simulators for science

Models in science

  • making models is part of the scientific method
    • models capture only some aspects of reality
    • when formalized, they enable quantitative, testable hypotheses
  • model functionalities
    • prediction — to support decisions
    • understanding — to select interventions
  • the structure that doesn't change is the model
    • the malleable part are parameters
    • parameters are 'tuned' based on observations
  • multiple input parameter sets can lead to the same output prediction
    • equifinality, degeneracy are key to resilience, homeostasis of complex systems

Simple pendulum

d2θdt2+gℓ sin⁡θ=0\frac{{\rm d}^2 \theta}{{\rm d} t^2} + \frac{g}{\ell}\, \sin\theta = 0dt2d2θ​+ℓg​sinθ=0
  • Can predict angles θ(t)\theta(t)θ(t) given g/ℓg/\ellg/ℓ and θ(0)\theta(0)θ(0).
  • Can infer g/ℓg/\ellg/ℓ from measured θ(t)\theta(t)θ(t).
    • for small amplitudes sin⁡θ≃θ\sin\theta\simeq\thetasinθ≃θ and timing one oscillation T=2πℓ/gT=2\pi \sqrt{\ell/g}T=2πℓ/g​ approximately suffices to infer g/ℓg/\ellg/ℓ → TTT summarises θ(t)\theta(t)θ(t) for inference.
  • And extract understading wrt. interventions and counterfactuals.
‣
§

Simulators, everywhere

Cranmer et al. 2020 (
  • 20th-century: explosion of digital, expressive simulators → "complex science"
    1. "simulator as numerical solver" for an explicit model (e.g. PDEs) — based on discretization
    2. "simulator as defined by code", an implicit model built from individual interaction rules
    3. and anything in-between, e.g. pulse-coupled NNs: discrete + continuous dynamics
‣
§

Simulators galore

But what are simulators?

  • simulate - /ˈsɪm·jəˌleɪt/ (verb). (Cambridge English dictionary)
    1. 1. create conditions or processes similar to something that exists.

    2. to produce a situation or event that seems real but is not real, especially in order to help people learn how to deal with such situations or events
  • simulation might be a key ingredient of cognitive processing? cf. predictive coding. (imagination, speculation, platonic shadow)
  • typology: continuous vs. discrete (regular vs. event-based), dynamic vs. steady-state, deterministic vs. stochastic...
  • Let's look at a couple of examples
‣
§

Agent-based models

Microscale models "defined by code". Agents represented directly, not by density or concentration; possess internal state, interaction rules, and learning processes that determine state updates; they live in a topology embedded in an environment.

SIR ABM running (w. social distancing, isolated=0.8).
Parameters of SIR model in
Parameters of SIR model in Agents.jl.

Other examples: culture modelling, tumor growth, epidemics, ecology, traffic, percolation (oil through soil), voting...

‣
§

Differential-equation-based models

  • continuous models for dynamically changing phenomena
  • relate system response to infinitesimal changes of state variable(s)
  • classes of Diferential equation (DE):
  • ODE: ordinary DE: one variable yyy, F(x,y,y′,…,y(n−1))=y(n)F\left (x,y,y',\ldots, y^{(n-1)} \right )=y^{(n)}F(x,y,y′,…,y(n−1))=y(n) PDE: partial DE: multiple variables, e.g. time and space SDE: stochastic: DE with stochastic fluctuations (→ stochastic process)

  • Solution via integration is rarely possible in closed form
  • → numerical approximations: discretization.

Simulator example: core-collapse supernova models

Goal. Test understanding of physics. What are the mechanisms that make SN explode?

Model. Euler eqs. + eq. of state + conservation of mass (1), momentum (2), energy (3)

∂tρ+∇(ρv)=0∂t(ρv)+(∇⊺(ρvv⊺))⊺+∇p=−ρ ∇Φ∂t(ρE)+∇⊺((ρE+p)v)=−ρv⊺∇Φ+Qν+Qnuc\small{\begin{align} \partial_t \rho + \nabla (\rho \bm{v}) &= 0 \\ \partial_t (\rho \bm{v}) + (\nabla^\intercal (\rho \bm{v} \bm{v}^\intercal))^\intercal + \nabla p &= -\rho\, \nabla \Phi\\ \partial_t (\rho E) + \nabla^\intercal ((\rho E + p) \bm{v}) &= -\rho \bm{v}^\intercal \nabla \Phi + Q_{\nu} + Q_{\rm nuc} \end{align}}∂t​ρ+∇(ρv)∂t​(ρv)+(∇⊺(ρvv⊺))⊺+∇p∂t​(ρE)+∇⊺((ρE+p)v)​=0=−ρ∇Φ=−ρv⊺∇Φ+Qν​+Qnuc​​​

Observables. lightcurves, neutrinos, EM spectrum (🤯 SN remnants visible for centuries, but relevant dynamics on ≲1 s\lesssim 1\,{\rm s}≲1s)

Solution.

  • PDEs → Finite Elements Model (FEM)
  • "parameter-free"
  • months on high-performance computers

Supernova simulation 400ms post-bounce, courtesy Thomas Janka →

image
‣
§

Epistemic ambition of simulators

🖍️ Theory building — focus on process

  1. Formalizing heuristics for understanding, running simulator to generate hypotheses.
  2. Use of reduced models to understand dynamical degrees of freedom and interactions.

🖥️ Hypothesis testing — focus on result

Expected to quantitatively reproduce empirical measurements. Our workshop.

Problem: model misspecification

"Explanation requires reduction", all models are misspecified.

Parsimony (modelling "from the null up", Premo, 2007) vs. ommitted variable bias.

‣
§

Interlude: introduce your simulator

Share your simulator-based research with the group. You can re-use slides you have. Paste them here in this Google presentation.

  • Key features of your use case (use up to 2 minutes):
    • why: phenomenon and scientific meaning of parameters and observations. Do you expect degeneracies / equifinality? what would they mean?
    • what: type, dimensionality and structure of parameters and observations, sources of noise
    • how: type of simulator (ODE, ABM...), programming language
    • when: what is the approximate runtime?
  • questions from the audience (about 1m).

Simulators vs. statistical models

Models, and models

Simulations are models (often stochastic), but what about 'statistical models'. Differences?

Models can either be mechanistic or convenient (Ripley 1987, Stochastic Simulation p3).

Statistical models

"Accommodate and fit data"

Statistical relationship, correlation

Interpretable post-hoc, if at all

Poor performance outside of training set

Less opinionated — needs much data

Built for inference ('fit')

Mechanistic models (many simulators)

"Formalize principles and hypotheses".

I→O mechanistic link, causal if time involved

Interpretable by design

Extrapolate well (if validated)

More opinionated — works with little data

Inference ❓

‣
§

Example: modelling counts vs. modelling processes

RNN model of infection counts

Designed to fit data and predict

  • RNN flexible approximator,
  • specific architecture to fit time correlations
  • many, O(104)\mathcal{O}(10^4)O(104), opaque parameters
  • model hard to debug and interpret
  • data-hungry,
  • poor extrapolation
Recurrent Neural Network (P. Roelants)
Recurrent Neural Network (P. Roelants)

Epidemic compartmental model

Designed for insight

  • ODE, limited modelling repertoire
  • use rate constants for population kinetics
  • few, O(1)\mathcal{O}(1)O(1) interpretable parameters
  • expresses interaction rules, conservation laws, smoothness constraints.
  • misspecification risk, but know how to formulate hypotheses as compartments
SEIR compartmental model (Alex Gessner).
SEIR compartmental model (Alex Gessner).
‣
§

Induction vs. deduction, an age-old dichotomy

Data reduction vs. theory making. Philosophy of science as a guide.

  • Induction ∼\sim∼ statistical models: from experimental data, infer the simplest (usually statistical) laws that account best for the data.
  • Deduction ∼\sim∼ mechanistic models: from first principles, obtain laws that produce predictions in specific settings. Experimental design seeks measurements that falsify the laws.
‣
§

Inference on simulators

An inverse problem

  • from parameters to measurements: modelization, simulation: forward problem
  • from measurements to parameters: parameter estimation: inverse problem
  • point estimates via optimization
    • grid search (not scalable)
    • derivative-based optimization (gradients generally not available)
    • population methods: evolutionary strategies
  • share: no principled ranking of point estimates

Multiple and uncertain parameter sets

  • functions destroy information, when they are noninjective (underdetermination, equifinality)
  • θ^\hat{\theta}θ^ replicates observation xox_{\rm o}xo​, i.e. xo=sim⁡θ^x_{\rm o} = \operatorname{sim} \hat{\theta}xo​=simθ^. But what about sim⁡(θ^+ϵ)\operatorname{sim}(\hat{\theta} + \epsilon)sim(θ^+ϵ)
    • this could be a serious problem
    • sensitivity analysis is the discipline dealing with the sim⁡(θ^+ϵ)\operatorname{sim} (\hat{\theta}+\epsilon)sim(θ^+ϵ) problem
  • solution? estimate θ^(x)\hat{\theta}(x)θ^(x), calculate confidence sets
    • uncertainty estimate ✔️
    • find the multiple θ^\hat{\theta}θ^ which could have generated xox_{\rm o}xo​ (and rank them) ❌
‣
§

Bayesian parameter inference

uncertainty: getting θ\thetaθ might be only one step in an inference or decision pipeline.

degeneracy: ignoring alternate θ\thetaθ leading to the same xox_{\rm o}xo​ discards alternative explanations upfront.

A full posterior parameter distribution conditioned on the observation, p(θ∣xo)p(\theta|x_{\rm o})p(θ∣xo​) provides both.

  • But how to get there?
  • simulator data + observation + inference 🪄 (∼\sim∼day 2 of the workshop)→ posterior

  • And what to do with them?
  • posterior + validation / interpretation (∼\sim∼ day 3 of the workshop) 🪄 → hypotheses, results

Introducing the posterior

For θ∈RD>2\theta \in \mathbb{R}^{D>2}θ∈RD>2 we need strategies for representation. Ground truth ∙≡{θ1,θ2}\orange{\bullet}\equiv\{\orange{\theta_1, \theta_2}\}∙≡{θ1​,θ2​}

  1. marginal posteriors in 2D 🔨 integrate out D−2D-2D−2 parameter dimensions
  2. p(θ1,θ2∣xo)=∫p(θ1,θ2∣xo,θ3:31) dθ3:31p(\theta_1, \theta_2|x_{\rm o})=\green{\int} p(\theta_1,\theta_2|x_{\rm o}, \green{\theta_{3:31}})\ \green{{\rm d}\theta_{3:31}}p(θ1​,θ2​∣xo​)=∫p(θ1​,θ2​∣xo​,θ3:31​) dθ3:31​

    typically more like a blob!

    Posterior plots courtesy Michael Deistler →

image
  1. conditioned posteriors in two dimensions
  2. 🏖️ fix D−2D - 2D−2 parameter dimensions

    p(θ1,θ2∣xo,θ3:31←θ∗)p(\theta_1,\theta_2|x_{\rm o}, \color{green}{\theta_{3:31}\leftarrow\theta^\ast})p(θ1​,θ2​∣xo​,θ3:31​←θ∗)

    typically sharper!

image

Interlude: Bayesian inference

The Bayesian workflow

Assume we want to know θ\thetaθ and have measured data on xxx.

  1. Modeling (day 1): formulate joint pdf p(x,θ)p(x,\theta)p(x,θ)
    • product rule: p(x,θ)=p(x∣θ)p(θ)=p(θ∣x)p(x)p(x,\theta) = p(x|\theta) p(\theta)=p(\theta|x)p(x)p(x,θ)=p(x∣θ)p(θ)=p(θ∣x)p(x).
    • prior p(θ)p(\theta)p(θ) summarizes our knowledge of the parameters (e.g. 'must be positive')
    • likelihood p(x∣θ)p(x|\theta)p(x∣θ) embodies our knowledge of how xxx is generated from θ\thetaθ (→ model)
    • rinse and repeat (update the prior: the posterior is the new prior)
  2. Inference (day 2): Use the data x1:Nx_{1:N}x1:N​ to learn about the target variable θ\thetaθ by conditioning
    • Typically, use x1:Nx_{1:N}x1:N​ to fix ϕ\phiϕ in pϕ(θ∣x)p_\phi(\theta|x)pϕ​(θ∣x)
    • Multiple computational approaches, depending on the problem
  3. Validation and interpretation (day 3)
    • check internal consistency of the posterior (SBC, posterior-predictive checks)
    • check consistency with domain knowledge

See: Bayesian Worklflow, Gelman et al. 2020 (https://arxiv.org/abs/2011.01808).

Bayesian inference

p(θ∣x)=p(x∣θ)p(θ)∫θp(x∣θ) p(θ)\color{0B6E99}{p(\theta|x)} =\frac{ \color{E03E3E}{p (x|\theta)}\color{6940A5}{p(\theta)}}{\color{9B9A97}{\int_\theta p(x|\theta)\ p(\theta)}}p(θ∣x)=∫θ​p(x∣θ) p(θ)p(x∣θ)p(θ)​

The Iikelihood (model) times the prior divided by the evidence yields the posterior.

  • For very few models this equation yields a posterior in closed form.
    • a closed-form posterior can be sampled from and evaluated. Ideal case!
  • Usually the evidence integral is hard to compute
    • Markov-Chain Monte-Carlo (stochastic, asymptotically exact), works with unnormalized joint. Implicit p(θ∣x)\color{0B6E99}{p(\theta|x)}p(θ∣x): low bias / high-variance (slow). Only samples!
    • Variational inference (approximate, deterministic). Assume explicit parametric posterior, fast.

Simulators as statistical models

Stochasticity and simulators

  • Typically stochasticity models all the processes where we lack specific mechanistic hypotheses
    • unless the process is itself stochastic at the physical level, e.g. β\betaβ decay in a nucleus.
  • What are sources of stochasticity?
    • unobserved latent variables
      • stochastic program paths
    • instrument noise (aleatoric)
    • numerical approximations (→ probabilistic numerics)

→ Outputs must then be stochastic themselves - random variables → probabilities!

Since simulators have a stochastic component, could we treat simulators as statistical models?

‣
§

Simulator as generative models

p(θ∣x)=p(x∣θ)p(θ)∫θp(x∣θ) p(θ)\color{0B6E99}{p(\theta|x)} =\frac{ \color{E03E3E}{p (x|\theta)}\color{6940A5}{p(\theta)}}{\color{9B9A97}{\int_\theta p(x|\theta)\ p(\theta)}}p(θ∣x)=∫θ​p(x∣θ) p(θ)p(x∣θ)p(θ)​
  • p(x∣θ)\color{E03E3E}p(x|\theta)p(x∣θ) usually not evaluatable for simulators, as not built for inference.
  • generative models build a probability density over samples — often an implicit one! (e.g. GANs)
  • simulators are generative models with implicit likelihood. Most general scenario for inference - inference just from samples - likelihood-free inference (LFI).
  • let's look at simulators from a probabilistic perspective
‣
§

Anatomy of a simulator: notation

Probabilistic programs are programs with stochasticity that are interpreted as statistical models. In this sense, simulators are probabilistic programs, computer programs that

  • take parameter vector as input, θ\thetaθ — we treat θ\thetaθ as a random variable, and assign it prior p(θ)p(\theta)p(θ)
  • compute internal states — latent variables zℓ∼pℓ(zℓ∣θ,z<ℓ)z_\ell \sim p_\ell(z_\ell|\theta,z_{<\ell})zℓ​∼pℓ​(zℓ​∣θ,z<ℓ​)
  • produce result vectors xxx comparable with experimental observations xo∼p(x∣θtrue)x_{\rm o}\sim p(x|\theta_{\rm true})xo​∼p(x∣θtrue​)

Since simulators have often a stochastic component and often no explicit functional form, we can borrow 'sampling' notation, x∼p(x∣θ,z)x \sim p(x|\theta,z)x∼p(x∣θ,z).

Some notes about the parameters θ\thetaθ → latents z1:Lz_{1:L}z1:L​ → outputs xxx

  • θ\thetaθ fixed dimensionality, typically no structure
  • xxx can be structured (e.g. images, graphs...), and high-dimensional
  • zzz correspond to meaningful states, but are typically unobservable.
    • continuous or discrete, changing dimensionality (even during simulation)
    • updated deterministically, or stochastically
    • we will NOT infer zzz here
  • but the existence of this unobserved state is problematic...
‣
§

Latent-path intractability

Let's talk about p(x∣θ)\color{E03E3E}{p(x|\theta)}p(x∣θ). How can a simulator become intractable, beyond lack of normalization?

p(x∣θ)=∫p(x,z∣θ) dz(sum over execution traces)p(x,z∣θ)=p(x∣θ,z)∏1:Lpℓ(zℓ∣θ,z<ℓ)(if sequential gen.)\begin{align*}p(x|\theta) &= \int \color{orange}{p(x,z|\theta)}\, \mathrm{d}z \quad \color{grey}\textsf{(sum over execution traces)}\\ \color{orange}{p(x,z|\theta)} &= p(x|\theta,z)\prod_{1:L} p_\ell(z_\ell|\theta, z_{<\ell})\quad \color{grey}{\textsf{(if sequential gen.)}} \end{align*}p(x∣θ)p(x,z∣θ)​=∫p(x,z∣θ)dz(sum over execution traces)=p(x∣θ,z)1:L∏​pℓ​(zℓ​∣θ,z<ℓ​)(if sequential gen.)​

We seek likelihood-free techniques to free us from latent-path intractability; ideally they'd also solve the normalization problem.

Partial observability in the field

Another name is partial observability. Here's a case from epidemiology (Kulkarni et al. 2021, discrete compartmental model)

Three observed, three unobserved populations (arXiv:2106.15508 [cs.DC]).
image

Ways to make likelihoods tractable

  • model reduction / coarsening. different model / limits expressivity, model-specific ❌
  • likelihood data augmentation. model-specific, iterative ❌
  • (...) introduce additional parameters ψ\psiψ, which represent missing data, in such a way that the likelihood p(x,ψ∣θ)p(x,\psi|\theta)p(x,ψ∣θ) is tractable. Inference then proceeds by estimating both θ\thetaθ and ψ\psiψ, typically via EM or MCMC. (after O'Neill 2010)
  • martingales. elegant ✔️ few specific models, hard ❌
  • don't? likelihood-free inference: generally applicable, can target posterior✔️ performance❓, cutting edge❓
‣
§

Levels of simulator access

Think: how much do we know about a simulator? i.e. mathematical model, code accessible, run-time accessible, gradients and other quantities...

Your simulator access

Analytic Iikelihood (w. gradients maybe)

Analytic conditional likelihood p(x∣z,θ)\color{orange}{p(x|z,\theta)}p(x∣z,θ)

import sim (source code)

x = GET(θ) (just samples)

Inference strategy

Variational inference, MCMC (use gradients)

Data augmentation, martingales, augmented LFI

Source-level AD (LLVM, Julia) +++ augmented LFI

Probabilistic programming (inference compilation)

▶️Bare likelihood-free inference◀️ our workshop

See Cranmer et al. 2020 about LFI augmentation strategies for improved sampling efficiency.

See PyProb for inference compilation, Madminer for augmented LFI in HEP.

‣
§

Simulation-based inference

LFI, ABC, SBI, FBI, CIA, WHO?

LFI — Likelihood-free inference. Any technique not requiring likelihoods

ABC Approximate Bayesian Computation — population genetics (Tavaré et al. 1997); use of sampling, implicit posteriors.

SBI SampleSimulation-based inference — 💡 non ABC LFI. Modern LFI, often using NNs.

Let's have a hands-on look in a simplified case.

‣
§

Interlude: from ABC to SBI

Open notebook for a first practical contact with ABC/SBI.

Conditional density estimation

Amortized SBI is density estimation

Amortization is sharing parameters across models for different predictions.

Conditional density estimation essentially provides us with amortized sbi.

image

With flexible neural networks we can estimate

  • the likelihood (emulation). Prior-independent, but need still MCMC, VI for the evidence.
  • the posterior directly.
‣
§

Advantages derived from using networks

  • Feature learning, can incorporate inductive biases
  • Scales well with more data
  • Interpolation properties
  • Differentiable

Further materials

  • Slides for the next sessions are available on GitHub
    • https://github.com/mlcolab/sbi-workshop

Resources

  • Review. Cranmer et al. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48), 30055–30062.
  • Software. Tejero-Cantero et al., (2020). sbi: A toolkit for simulation-based inference. Journal of Open Source Software, 5(52), 2505 — 🧑🏽‍💻 https://github.com/mackelab/sbi
  • Application (neuroscience). Gonçalves et al. (2020). Training deep neural density estimators to identify mechanistic models of neural dynamics. ELife, 9, e56261.
Logo

MLColab  ⊂

Cluster ML in Science  ⊂

University of Tübingen

::: Visit us

🄯 ml colab team, licensed cc-by-sa except where noted.

XGitHubMastodon