LOR: ML-accelerated hierarchical SPS models
Co-authors (alphabetical): Justin Alsing*, George Efstathiou, Boris Leistedt*, Joel Leja, Daniel Mortlock, Hiranya Peiris
Co-signers (alphabetical): Sylvie Dagoret-Campagne, Jean-Eric Campagne, Francois Lanusse, Alex Malz
*corresponding authors: justin.alsing@fysik.su.se, b.leistedt@imperial.ac.uk
0. Summary Statement
Recent applications of stellar population synthesis (SPS) methods to photometric data, combined with advances in machine learning, open up new possibilities for deriving accurate photometric redshifts for the large amounts of data to be delivered by Rubin’s Legacy Survey of Space and Time (LSST). In this letter we set out the life cycle (preparation, deployment, and application) of such models for the LSST.
1. Context and Scientific Utility
Accurate photometric redshifts and redshift distributions are central to most cosmological analyses and therefore an active area of research. Template-based techniques are the focus of this letter. They typically employ spectral energy distributions (SEDs) from real data (e.g., nearby galaxies), data-driven models (composites), stellar population synthesis (SPS) models, or a combination of the three. However, traditionally, in order to limit the number of dimensions and make the problem tractable, only coarse grids of SEDs and simplistic models (e.g., with simple star formation histories) are used. Furthermore, the choice of SEDs, priors, and data calibration are typically based on training/validation data (e.g., with spectroscopic redshifts) and the calibrated model is then applied to the rest of the survey. This approach has shown its limits, since the training/validation data are not representative of wide-area surveys, which are also affected by spatially-varying zeropoints and noise. Furthermore, ad-hoc choices for SED models and priors can result in significant systematic redshift biases.
Some of those caveats can be alleviated by approaching this problem hierarchically, with three ingredients: A) a data model, parameterizing zeropoints and noise corrections on the sky (i.e., in tracts) and with priors for example taking advantage of our understanding of photometric images, B) a SPS model to generate galaxy SEDs (and integrate them in bandpasses) from first principles via composite stellar populations, C) a prior on the SPS parameters, potentially informed by our knowledge of cosmology and galaxy formation (theory and simulations), as well as selection effects in the data (e.g., depth-selection). This results in a set of latent SPS parameters (per object) and hyper-parameters describing the SPS prior, galaxy populations, survey calibration, etc. Furthermore, this hierarchical model can be expanded with other elements such as bandpass corrections and foregrounds (e.g., Galactic extinction), or to learn about the SPS priors or selection effects in a data-driven way.
Recently, complex SPS models (such as prospector alpha) leveraging this hierarchy have been successfully applied to photometric data for a variety of purposes (e.g., Leja et al 2016, 2019). However, those studies focused on small areas where many bands (20+) were available, where spatial variations could be neglected, and often fixing the redshift to existing photometric or spectroscopic redshifts estimates. Expanding this to infer photometric redshifts and support more complicated hierarchies of parameters (including more sophisticated data model and priors on SPS parameters) is essential for a wide-field broad-band survey like LSST, but also challenging because: (1) calls to SPS models are expensive; (2) the SPS (or “latent”) parameters for each galaxy may not be very constrained by the data (e.g., ugrizy photometry), resulting in complex, multimodal likelihoods and increased sensitivity to the priors;; (3) hyperparameters may also be ill-constrained or suffer from degeneracies.
We are taking steps to tackle these issues with modern machine learning techniques and model selection: for (1), we built emulators of the photometric magnitudes or fluxes given any choice of model and bands, trained on prior samples (Alsing et al 2019). For (2) we are developing an efficient hierarchical affine ensemble MCMC sampler, and experimenting with fitting the likelihood surfaces themselves (neural likelihood methods), again trained on prior samples. (3) can be recast into designing the choice of SPS model, and the form of the SPS priors and their hyperparameters, so that they can be constrained by LSST data. We are exploring this issue with both broad-band wide-field surveys (KiDS-VIKING at present) and many-band pencil surveys (COSMOS 2015) to develop a tractable solution for LSST.
The scientific impact of hierarchical SPS approach reaches well beyond photometric redshifts, since it would provide insights on data calibration (e.g. zeropoint offsets), and constraints on a range of physical parameters (e.g., star formation rates), which are of broad interest in cosmology and galaxy formation.
2. Inputs-Outputs
For any choice of hierarchical SPS model, the inputs of the inference methodology are the flux estimates, flux uncertainties, and any other metadata needed (e.g., position or tract number for zeropoint offsets). The outputs are posterior PDFs on the SPS parameters (including redshift) of each object. The hyperparameters can be fixed, or marginalised over. This immediately raises the question of how the PDFs are obtained and represented, since they are typically 5-10 dimensional and multimodal. However, machine learning can resolve those issues too. Change of variables such as normalizing flows solve the problem of storage. The choice of normalizing flow can be adjusted to conform to the storage budget and efficiency of the representation. But even with a good representation, inference (i.e., optimizing the parameters of the flows representing the posterior PDFs) may prove challenging and computationally intensive. But one can train an additional model (e.g., a neural likelihood) to estimate those parameters in a single pass from the set of fluxes and flux uncertainties of each object, therefore bypassing MCMC sampling. This is an additional piece of technology to train during step (2) above (and the emulators of step (1)), so it is beyond the scope of what DM would run. In addition to the compressed posterior PDFs, the algorithm will produce flags to indicate if any of the machine learning models were used beyond the range of inputs they were trained on and intended to be applied to. The models involved here (neural likelihoods, flow representations, and SPS priors) can be simple neural networks, with minimal memory footprint, loaded and applied in python.
3. Performance
We can reasonably assume that the models described above can be compressed and emulated with a neural network to deliver an approximate posterior PDF for each object in the representation that satisfies storage requirements. The accuracy of all ML models involved in the process (SPS model calls, SPS priors, and the final likelihood or posterior calls) can be well estimated since they are trained on prior samples. In the event of a ML model not being sufficiently accurate and if the architecture cannot be made more complex to address this, the error could be modeled and propagated.
4. Technical aspects
We propose the following cycle for calibrating and applying ML-accelerated hierarchical SPS models:
Step 0: choose a model and priors (the three ingredients of the hierarchy described above, their parameters and priors). This must be decided by balancing the constraining power of the data and the ability to calibrate and validate the model on simulations or precursor survey data.
Step 1: train emulators for the model calls and photometric likelihood calls. These can be trained for arbitrary models and photometric bands. Arbitrarily high accuracy can be reached with more complex architectures. This does not depend on the target catalogs, only on survey characteristics (e.g., noise distributions).
Step 2: constrain the hyperparameters of the hierarchical model with data or simulations. This could be time consuming but would not be run by DM.
Step 3: return to the individual objects in the target data and obtain posterior distributions on the latent parameters, including redshift. This can be designed to have a minimal computational footprint, since it amounts to changing the population prior on individual SPS parameters (keeping the likelihood fixed). It is possible to do this efficiently: importance sampling if the posteriors are represented with samples, but there are other options, and machine learning may again prove effective (e.g., to represent the posterior distributions and modify them, subject to changes of the population-level priors).
Steps 2 and 3 operate as a loop: the population priors (and any other model parts and hyperparameters, such as a data model) are re-calibrated as new simulations or data come available, with which individual posteriors can then be updated. We anticipate only step 3 would be run by the DM.