Scientific machine learning for data-driven discovery
Scientific discovery lacks, by definition, a ground truth. We don't know if the problem is solvable or how well we can do. There rarely are benchmark datasets. Data is missing acutely not at random due to sensor failures and collection bias. A substantial body of previous knowledge needs consideration, such as conservation laws, dynamical equations, and integrity constraints. Prediction is seldom enough: causal understanding is the ultimate goal, and uncertainty evaluation and interpretability are requisites. Data acquisition is mediated not by analytics of web behaviour but by expensive, often unique, experiments, and data modalities are often mixed, sometimes exotic.
We see substantial potential for new developments at the interface between ML and topical research. Besides the abundant algorithmic challenges in scaling, robustness, interpretability, and expression of inductive biases, there are opportunities at the edges of the ML pipeline, i.e. on the steps that are most actionable for domain scientists: problem definition, data collection, feature development, quality evaluation, and formulation of new hypotheses and interventions to address causality.
With its specific challenges, methods, and standards a field is emerging. Some are calling this cross-disciplinary endeavour scientific machine learning.
A colaboratory seeds a community of practice
At the ml ⇌ science colab we seek to develop a community of practice to tackle scientific problems with machine learning methods. For that, we
- work with domain specialists across the natural sciences, social sciences, and humanities to advance this field by collecting experience with currently open problems and fresh data.
- constantly collaborate with the ML research community in Tübingen to address the tough, specific challenges of this setting.
- train postgraduate scientists to understand data problems and use the latest algorithms and inference software.
- set out to find what works where but also what fails and why. For the benefit of the community, we make a point of communicating both in traditional and interactive formats.
- develop pieces of inference software that we consider of general interest to researchers across disciplines and want to see survive the career progression of the original author(s).
We are part of a larger effort worldwide toward making machine learning a useful tool to a broader audience — and a more reliable one at that. Steps in that process include developing engineering practices for machine learning [1] and largely automating some of the craft that goes into it [2].