New Star Galaxy Separator


(Sophie Reed) #1

I have developed a new star galaxy classifier for use with the stack. Currently the only trained classifier available is for HSC.

Currently Available Classifiers

Currently the only classifiers available are for HSC. These were trained on the COSMOS field using a cleaned up version of the catalogue from Leauthaud et al 2007 as the truth table. This cleaned up version is available as a reference catalogue (location to be finalised) and the associated images (for validation purposes) are in /project/sr525/hstCosmosImages/. The classifiers were trained on a combination of different depth reruns of the same data so that the data covered a wider range of signal to noise and contained data more representative of the wide survey. These classifiers can be found in /datasets/hsc/repo/algorithmData/starGalaxyClassifier/ and the reference catalogues used to train them in /datasets/hsc/repo/ref_cats/HST_truth_table_star_galaxy_refCat/.

Running The Classifier

The classifier runs on the deepCoadd_obj tables that are produced by running writeObjectTable.py. For example the line below makes band merged parquet tables for each patch in the given tract from the weekly 28 reprocessing run (this is just an example and may not exist at the time of reading).

writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_28/DM-14988/ --output /your/output/directory/here --id tract=9813 filter=HSC-G^HSC-R^HSC-I^HSC-Z^HSC-Y --no-versions -j 18 --clobber-config

Once the object tables have been made the classifier can be run by using the StarGalaxyClassifierTask. The version of classifier can be changed (for example if you have trained your own using the instructions below) by setting the config option version.

sgClass = StarGalaxyClassifierTask()
sgClass.runDataRef(dataRef)

This will produce a deepCoadd_sg table that is a duplicate of the deepCoadd_obj table but with a new index that has the features used to classify the objects and columns with the classification for each object. These columns are:

starProbColors: the probability of it being a star based off of its colours and morphology.
starProbMorph: the probability of it being a star based off of morphology information only.
starProbColorsFlag: a flag that is set if the classification failed for that object for any reason (mostly likely one of the input features was NaN for that object).
starProbMorphFlag: as above but for the morphology only classifier.

Training A New Classifier

To train a new classifier you need to make a file of the features you want to include. This is done using the StarGalaxyFeaturesTask and is done on a per patch level. These per patch files then need to be combined to cover a larger area, one patch does not contain enough data for reliable training. Joining the features files together is done by JoinFeaturesTask. An example of how to do this is below:

skymap = dataRef.get("deepCoadd_skyMap")
tractInfo = skymap[tract]
for patchInfo in tractInfo:
    patch = patchInfo.getIndex()
    patch = str(patch[0]) + "," + str(patch[1])
    mkFeaturesSG = StarGalaxyFeaturesTask()
    dataRef.dataId["patch"] = patch
    try:
          mkFeaturesSG.runDataRef(dataRef)
    except (lsst.daf.persistence.butlerExceptions.NoResults, RuntimeError, ValueError) as e:
        # This stops the loop crashing if there is no data in the given patch
        continue

These feature tables then need to be combined so that there is enough data to train the classifier well. During this work I found that using multiple depths of training data combined together improved the classification. To allow for this JoinFeaturesTask takes a dict of dataRefs, one for each rerun. These are keyed by a name that can be used for debugging. It was also found that classification was improved by adding in more stars, this is controlled by a configuration parameter sourceTypes and is by default set to ["galaxies", "stars", "stars", "all"] which corresponds to taking the galaxies from the first dataRef, the stars from the second and third and all objects fro the fourth. The resulting joined features are saved by the butler into the rerun that the first dataRef comes from.

dataRefs = {"weekly28_gals": dataRef0, "weekly28_stars":dataRef0, "coadd10":dataRef1, "coadd20":dataRef2  
joinFeatures = JoinFeaturesTask()
joinFeatures.runDataRef(dataRefs)

Once a sufficiently large set of features has been made the classifier is trained on them using MakeStarGalaxyClassifierTask. This produces two decision tree classifiers by default (can be changed using the config option classifierType) one using the columns specified in the config option columnsMorph and one using the columns given by columnColor.

mkSGClass = MakeStarGalaxyClassifierTask()
mkSGClass.config.version = "myNewClassifier"
mkSGClass.runDataRef(dataRef)

This gives a new classifier accessible via the version config option.