LOR for Hybrid Machine Learning-Template Fitting Photometric Redshifts

Title: LOR for Hybrid Machine Learning-Template Fitting Photometric Redshifts
Contributors: Peter Hatfield (Oxford),
Co-signers: Nathan Adams (Manchester), Ken Duncan (Edinburgh), Matt Jarvis (Oxford), Aprajita Verma (Oxford)

0. Summary Statement
This letter of recommendation describes the advantages of combining template-based and machine learning-based photometric redshifts into hybrid predictions. Each approach has advantages and disadvantages, and there are several approaches in the literature that have shown it is possible to get “the best of both worlds”. We are extragalactic researchers who have successfully used these methods for redshift calculation in current deep wide-field surveys.

1. Scientific Utility
Hybrid photometric redshifts seek to optimally combine machine learning and template fitting methods for improved photo-z, and would be applicable for any science that requires redshifts.

Machine learning-based (ML) based methods (e.g. GPz, ANNz2, SOMz) and template fitting-based methods (e.g. LePhare, BPZ, EAzY) are the two main classes of photometric redshift estimator. ML based methods are completely empirical, building a non-physical mapping from photometry to redshift using spectroscopic redshift data that is normally assumed to be the `true’ redshifts. Template based methods are predominantly based on physical theory; they use our understanding of how galaxy SEDs are formed (sometimes with a combination of empirical and synthetic SEDs), and then how they are redshifted, to form a mapping from photometry to redshift.

ML methods are typically more accurate than template methods when there is high quality spectroscopic training sample, but can fail when extrapolating to colour-magnitude space not present in the training data (or if the spectroscopic redshifts in the training data are themselves incorrect). Template based methods conversely are normally much better at extrapolating to new redshifts and colour-magnitudes, but are ordinarily not quite as accurate as ML methods, as the SED templates used in the fitting process are normally not quite perfect representations of true galaxy SEDs. They thus both have different advantages and disadvantages, and typically one method will perform better for some galaxy sub-samples than others, and vice versa.

A number of authors have shown hybrid photometric redshifts can outperform methods just based on ML or templates. There are two main “mechanisms” for building hybrid predictions. The first mechanism “method selection” divides up colour-magnitude space/the galaxies into two components by some methodology, and use the ML prediction for the galaxies where ML was expected to be more reliable, and vice versa. The second mechanism is “consensus building”, where for each galaxy the two predictions are combined into one prediction, which is hopefully less biased because multiple semi-independent methods have been used. Consensus building is particularly powerful when one method (normally the template-fitting) gives a sharp but multi-modal pdf, and the other method (normally the machine learning) gives a broader but uni-modal pdf. The second method can the identify which of the peaks in the first method is correct, resulting in a hybrid pdf with one very sharp pdf.

Examples of successful hybrid photo-z in the literature: Brodwin 2006 (see figure 4) used machine learning for the brightest AGN and PAH emitters (identified with a colour-magnitude cut) and used template-fitting otherwise; Duncan 2018 (see figures 7, 8, 9) used a hierarchical Bayesian model to combine ML and template photo-z pdfs; Desprez 2020 (see figure 10) used machine learning for z<0.6 galaxies if not flagged, and reverted to template-fitting otherwise; Hatfield 2020 (see figures 12, 13) used machine learning when interpolating between the spectroscopic training sample, and reverted to template fitting-based photo-z when extrapolating.

We note that in future hybrid approaches might be particularly appropriate for more complex scenarios e.g. strong lensing, where blending issues and other concerns can affect performance of conventional photo-z estimators.

2. Outputs
The algorithm will take as input for each galaxy both a template photo-z prediction and a ML photo-z prediction, and output a hybrid photo-z prediction. These predictions could be point estimates, point estimates with an uncertainty, statistics (percentiles or mean, standard deviation, skew and kurtosis) or a full pdf. For the “method selection” approach, in addition a flag is needed to determine if ML or template will be used. This flag could be based on the magnitudes, based on the uncertainties on the individual photo-z, or based on some other output of the photo-z code.

The template and ML predictions might be consensus photo-z themselves (e.g. the consensus of all the ML methods and the consensus of all the template methods). In particular one might use a hybrid approach to combine a template-based photo-z that uses galaxy templates with a template-based photo-z that uses AGN templates (or similar).

3. Performance
Overall performance will depend on the performance of the individual ML and template methods, but the use of a hybrid method ought give redshifts that outperform each individually.

4. Technical Aspects

Scalability – Will Meet

Precise prediction time depends on exact implementation, but should be subdominant to the computational cost of calculating the ML and template photo-z.

Inputs – Will Meet

The ML and template photo-z, possibly with flags/parameters that describe what methods are used in which parts of colour-magnitude space.

Outputs – Will Meet

Predictions could be point estimates, point estimates with uncertainties or full posteriors, depending on what the input photo-z are.

Storage Constraints – Will Meet

Possibly requires 1-2 extra columns for flags/parameters that describe how the hybrid predictions should be formed (perhaps a parameter for each of template photo-z reliability and ML photo-z reliability).

External Data Sets – Will Meet

The ML and template methods will require their respective external data sets.

Estimator Training and Iterative Development – Will Meet

Most implementations of hybrid methods are very simple, so will not require maintenance.

Computational Processing Constraints – Will Meet

Will operate on photo-z estimates, which will be based on measurements in the LSST Object catalogue. Will depend on method used, but typically would be a very small computational footprint.

Implementation Language – Will Meet

Could be easily implemented in any language.

Maintenance – Will Meet

Should be require very little maintenance/updating.

LOR_hybrid_estimation_approaches.pdf (178.1 KB)