Setting up REPO for HSC with master calibrations, HSC sky subtraction

Hello,

Is there a detailed tutorial for how to set up and populate a Butler repository for HSC data? I’ve installed v28.0.1 and successfully run the introductory tutorial using the rc2_subset of HSC data. What I’d like to do now is create a new repository so that I can follow the steps of the tutorial with my own data.

I’ve had some success from piecing together different posts and help pages. I can create a new REPO, register the instrument, write curated calibrated products for HSC to it, ingest raw images, and register a sky map. It’s not clear, however, how many additional steps I’m missing before I can run the pipeline.

Perhaps I’m trying to do too much from scratch? Can I repurpose elements of the REPO that are included with the HSC demo data?

I would particularly like to use the master calibration files prepared by Subaru. There is a discussion here about how to do that but it’s a little piecemeal.

Finally (with apologies for mixing topics), how is visit-level sky subtraction actually implemented for HSC? It’s discussed here but not mentioned in the getting started tutorial.

Many thanks for your help!

I’ve just yesterday finished my first HSC reduction run with the Gen3 middleware (using v28.0.1, like you). I’ll post my notes here soon, including how I used the SSP calibrations.

Visit-level sky subtraction is included in the new Gen3 pipeline automatically.

We wrote some notes for dp0.2 here: https://rtn-029.lsst.io/

Reducing some HSC data with the LSST v28.0.1 stack

Set up the stack:

pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ . /scratch/gpfs/LSST/stacks/stack_v28/loadLSST.bash
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup lsst_distrib


Set up the Postgresql database that I'll use for a registry:

In ~/.lsst/db-auth.yaml:

- url: postgresql://dbserver:5432/dbname
  username: myUserName
  password: myPasswrod

export REPO=/scratch/gpfs/RUBIN/user/price/REPO
mkdir -p $REPO

(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price/REPO $ cat seed-config.yaml
datastore:
  root: <butlerRoot>
registry:
  db: postgresql+psycopg2://dbserver:5432/dbname
  namespace: dbname_20250108

In psql:

dbname=> CREATE SCHEMA dbname_20250108;

(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler create --seed-config seed-config.yaml $REPO
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler register-instrument $REPO lsst.obs.subaru.HyperSuprimeCam

Next we want to ingest the reference catalogs. The PS1 refcat is about 400 GB, and I don't want to copy all of that in order to reduce a small number of discrete and known pointings.

butler register-dataset-type $REPO gaia_dr2_20200414 SimpleCatalog htm7
butler register-dataset-type $REPO ps1_pv3_3pi_20170110 SimpleCatalog htm7

Now we need to identify the appropriate files to ingest.

cp /projects/HSC/refcats/gaia_dr2_20200414.ecsv $REPO/
cp /projects/HSC/refcats/ps1_pv3_3pi_20170110.ecsv $REPO/

import numpy as np
import astropy.units as u
from astropy.coordinates import SkyCoord
from astropy.table import Table
from lsst.meas.algorithms.htmIndexer import HtmIndexer
from lsst.geom import SpherePoint, degrees
indexer = HtmIndexer(depth=7)
target = SkyCoord("1h23m45.678s", "12d34m56.78s", unit=(u.hourangle, u.deg))
shard, isBorder = indexer.getShardIds(SpherePoint(target.ra.deg*degrees, target.dec.deg*degrees), 3*degrees)
gaia = Table.read("gaia_dr2_20200414.ecsv")
gaia[np.isin(gaia["htm7"], shard)].write("gaia_target.ecsv")
ps1 = Table.read("ps1_pv3_3pi_20170110.ecsv")
ps1[np.isin(ps1["htm7"], shard)].write("ps1_target.ecsv")

Now ingest the files:

butler ingest-files -t copy $REPO gaia_dr2_20200414 refcats/gen2 gaia_target.ecsv
butler ingest-files -t copy $REPO ps1_pv3_3pi_20170110 refcats/gen2 ps1_target.ecsv

Register the skymap:

(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ cat skymap-target.py
config.skyMap = "discrete"
config.skyMap["discrete"].raList=[20.940325]
config.skyMap["discrete"].decList=[12.58243889]
config.skyMap["discrete"].radiusList=[2.3]

(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler register-skymap $REPO -C skymap-target.py -c name='target_v1'

Further instrument setup:

butler write-curated-calibrations $REPO lsst.obs.subaru.HyperSuprimeCam --collection HSC/calib

Now ingest the raw images:

for dd in /projects/HSC/users/price/target/raw-*; do
    butler ingest-raws $REPO $dd/HSCA*.fits* --transfer copy 2>&1 | tee -a ingest-$(basename $dd).log
done

I downloaded the calibs from Sogo Mineo and the SSP: https://tigress-web.princeton.edu/~pprice/HSC-calibs/

butler import $REPO /scratch/gpfs/RUBIN/datasets/calibs/s23b_wide_calib/ -t link

butler collection-chain $REPO HSC/defaults HSC/raw/all,HSC/calib,HSC/calib/gen2/CALIB_tp,HSC/calib/s23b_sky_rev,refcats/gen2,skymaps


Last setup step:

butler define-visits $REPO HSC

And now we should be able to run some data.

pipetask run --register-dataset-types -p "${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1" -d "instrument = 'HSC' AND exposure = 12345 AND detector = 49" -b $REPO -i HSC/defaults -o test-20250121

That worked! Time to expand.

Tiger3 compute nodes have 112 cores and 1TB memory.


wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
computeSite: tiger_1n_6h
site:
  local:
    class: lsst.ctrl.bps.parsl.sites.Local
    cores: 12
  tiger_1n_6h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 112
    walltime: "06:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    scheduler_options: "#SBATCH --account=rubin"


Urgh, the query syntax doesn't support "LIKE", so I need to list all the target names and filters explicitly.

bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o target_20240121 -d "instrument = 'HSC' AND exposure.target_name IN ('TARGET_1', 'TARGET_2') AND exposure.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9"

FileNotFoundError: Not enough datasets (0) found for non-optional connection calibrateImage.astrometry_ref_cat (ps1_pv3_3pi_20170110) with minimum=1 for quantum data ID {instrument: 'HSC', detector: 59, visit: 12345, band: 'g', day_obs: 20150123, physical_filter: 'HSC-G'}.

Looks like the 3 degree radius was insufficient. We may as well ingest the whole refcats now: the calibs are 1.2 TB, so adding the refcats isn't a huge deal, and it will help others who want them as well.

butler remove-runs $REPO refcats/gen2

Switching GAIA from DR2 to DR3, because I see the latter is available.
Linking instead of copying, because there's now a copy of the files on /scratch/gpfs/RUBIN.

butler register-dataset-type $REPO gaia_dr3_20230707 SimpleCatalog htm7

cd /scratch/gpfs/RUBIN/datasets/refcats/
butler ingest-files -t link $REPO gaia_dr3_20230707 refcats/gaia gaia_dr3_20230707/gaia_dr3_20230707.ecsv
butler ingest-files -t link $REPO ps1_pv3_3pi_20170110 refcats/ps1 ps1_pv3_3pi_20170110/ps1_pv3_3pi_20170110.ecsv

Update the "HSC/defaults" chain:

butler remove-collections $REPO HSC/defaults
butler collection-chain $REPO HSC/defaults HSC/raw/all,HSC/calib,HSC/calib/gen2/CALIB_tp,HSC/calib/s23b_sky_rev,refcats/gaia,refcats/ps1,skymaps


Trying the "bps submit" command again...


Quanta          Tasks          
------ ------------------------
 80475                      isr
 80475           calibrateImage
 80475 analyzeAmpOffsetMetadata
 80475  transformPreSourceTable


Error 23:
        Failed to start block 22: Cannot launch job parsl.tiger.block-22.1737501036.8215468: Could not read job ID from submit command standard output; recode=1, stdout=, stderr=sbatch: error: ERROR: You have to specify an account for your slurm jobs with --account option from these options: merian rubin
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Our scheduler_options didn't make it into the submission script... Oh, it did, but without the leading "#SBATCH".
Scale down to verify we've fixed the problem...

bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o test-20250121 -d "instrument = 'HSC' AND exposure = 12345 AND detector != 9"

Yep, that worked. Now we can try the full run again.

lsst.ctrl.bps.drivers INFO: Submit stage completed: Took 96299.4631 seconds; current memory usage: 4.697 Gibyte, delta: 0.402 Gibyte, peak delta: 0.017 Gibyte
lsst.ctrl.bps.drivers INFO: Submission process completed: Took 98858.9745 seconds; current memory usage: 4.697 Gibyte, delta: 4.511 Gibyte, peak delta: 4.511 Gibyte
lsst.ctrl.bps.drivers INFO: Peak memory usage for bps process 4.697 Gibyte (main), 9.537 Gibyte (largest child process)
Run Id: None
Run Name: target_20240121_20250121T233151Z


step1: detector
step2a: visit
step2b: tract (after step2a)
step2c: instrument (after step2a)
step2d: visit (after step2c)
step2e: instrument (after step2d)
step3: tract
step4: detector (skip: not for wallpaper science)
step7: instrument (after step3)

I want to add the clustering configuration to bps.yaml to attempt to improve the efficiency (and adding account directly, using DM-48539).

wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
computeSite: tiger_1n_6h
includeConfigs:
  - ${DRP_PIPE_DIR}/bps/clustering/DRP-recalibrated.yaml
site:
  local:
    class: lsst.ctrl.bps.parsl.sites.Local
    cores: 12
  tiger_1n_6h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 112
    walltime: "06:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin


bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2a' -b $REPO -i HSC/defaults -o target_20240121 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9"

FileNotFoundError: Not enough datasets (0) found for non-optional connection skyCorr.skyFrames (sky) with minimum=1 for quantum data ID {instrument: 'HSC', visit: 12345, band: 'i', day_obs: 20150123, physical_filter: 'HSC-I'}.

That fails because there aren't any sky frames for HSC-I data taken on that date. I'll need to move the certification dates around...

(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ python updateSkyCalibs.py
Updating 103 datasets from cpsky_g_140918_141001
Updating 103 datasets from cpsky_g_150325_150325
Updating 103 datasets from cpsky_g_151114_160111
Updating 103 datasets from cpsky_g_160307_160307
Updating 103 datasets from cpsky_r_150318_150318
Updating 103 datasets from cpsky_r2_211209_211209
Updating 103 datasets from cpsky_i_150121_150121
Updating 103 datasets from cpsky_i_150320_150322
Updating 103 datasets from cpsky_i2_181207_181213


Later discovered that I was missing a step:

pipetask run --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -b $REPO -i HSC/defaults -o HSC/fgcm -p '${FGCMCAL_DIR}/pipelines/_ingredients/fgcmMakeLUT.yaml' -d "instrument = 'HSC'"

That took a while, but it worked. Let's put HSC/fgcm onto HSC/defaults.

(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler collection-chain $REPO HSC/defaults --mode extend HSC/fgcm
[HSC/raw/all, HSC/calib, HSC/calib/gen2/CALIB_tp, HSC/calib/s23b_sky_rev, refcats/gaia, refcats/ps1, skymaps, HSC/fgcm]


OK, starting fresh from step1:

(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ cat bps.yaml
wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
#computeSite: tiger_1n_112c_5h
includeConfigs:
  - ${DRP_PIPE_DIR}/bps/clustering/DRP-recalibrated.yaml
site:
  local:
    class: lsst.ctrl.bps.parsl.sites.Local
    cores: 12
  tiger_1n_112c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 112
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
  tiger_1n_56c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 56
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
  tiger_1n_28c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 28
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
  tiger_2n_112c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 2
    cores_per_node: 112
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin


bps submit bps.yaml --compute-site tiger_2n_112c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND visit > 10000 AND detector != 9 AND detector < 104"

Quanta          Tasks          
------ ------------------------
 72512                      isr
 72512 analyzeAmpOffsetMetadata
 72512        characterizeImage
 72512                calibrate
 72512      writePreSourceTable
 72512  transformPreSourceTable


Need to use a master version of ctrl_bps_parsl for the 'account' parameter.

parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 142300
parsl.dataflow.dflow INFO: Tasks in state States.failed: 1362
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 1362

That looks fairly successful!

I hit CTRL-C before the whole thing had finished (even though it said it was done), and now the next stage isn't seeing the products created. I think I must have interrupted the final loading of the database. I think I need to run:

(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ butler --long-log --log-level=VERBOSE transfer-from-graph submit/target_20250228/20250228T172155Z/target_20250228_20250228T172155Z.qgraph $REPO --register-dataset-types --update-output-chain

VERBOSE 2025-03-01T09:50:08.225-05:00 lsst.daf.butler.direct_butler._direct_butler ()(_direct_butler.py:1877) - 21036 datasets removed because the artifact does not exist. Now have 1719261.
VERBOSE 2025-03-01T22:42:46.472-05:00 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:2504) - Completed scan for missing data files
Number of datasets transferred: 1719261


So now we can move on to step2a.

bps submit bps.yaml --compute-site tiger_1n_28c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2a' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104"

Quanta           Tasks          
------ -------------------------
   704 consolidatePreSourceTable
   704   consolidateVisitSummary
   704                   skyCorr

parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 2112
parsl.dataflow.dflow INFO: Tasks in state States.failed: 0
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 0


bps submit bps.yaml --compute-site tiger_1n_28c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2b' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"

Quanta          Tasks         
------ -----------------------
     5     gbdesAstrometricFit
     1 isolatedStarAssociation

Having trouble with the g-band astrometry again (iterating for a LONG time). Let's use my modified version of gbdes, and run under pipetask instead of bps (because there's so few quanta, and we have our own node now):

pipetask run --register-dataset-types -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2b' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 10

Done.

pipetask run --register-dataset-types -b $REPO -i HSC/defaults -o target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2c' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" 2>&1 | tee target_20250228-step2c.log

Quanta           Tasks           
------ --------------------------
     1 fgcmBuildFromIsolatedStars
     1               fgcmFitCycle
     1         fgcmOutputProducts

lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (All) = 4.47 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Blue25) = 4.58 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Middle50) = 4.35 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Red25) = 4.65 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (All) = 4.71 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Blue25) = 5.10 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Middle50) = 4.81 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Red25) = 4.51 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (All) = 4.34 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Blue25) = 4.48 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Middle50) = 4.10 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Red25) = 5.21 mmag


step2d is memory-hungry, so reduce the number of cores per node.

bps submit bps.yaml --compute-site tiger_1n_56c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2d' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"

Quanta            Tasks            
------ ----------------------------
   702     finalizeCharacterization
   683           updateVisitSummary
 69630 writeRecalibratedSourceTable
 69630         transformSourceTable
   683       consolidateSourceTable

That seems to be hanging, taking forever. Let's run it on our head node.

pipetask run --register-dataset-types -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2d' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 56 2>&1 | tee target_20250228-step2d.log

I think that took close to 24 hours, but it completed.

pipetask run --register-dataset-types -b $REPO -i HSC/defaults -o target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2e' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 50 2>&1 | tee target_20250228-step2e.log

Quanta       Tasks      
------ -----------------
     1 makeCcdVisitTable
     1    makeVisitTable

That was fast.
Now we get to the good stuff!

bps submit bps.yaml --compute-site tiger_2n_112c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step3' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"

Quanta            Tasks            
------ ----------------------------
     1      analyzeMatchedVisitCore
 16345                     makeWarp
   342        selectDeepCoaddVisits
   342                assembleCoadd
   342                    detection
     3       healSparsePropertyMaps
   116              mergeDetections
     3         plotPropertyMapTract
   116                      deblend
   342                      measure
   116            mergeMeasurements
   342              forcedPhotCoadd
   116             writeObjectTable
   116         transformObjectTable
     1       consolidateObjectTable
     1            catalogMatchTract
     1      validateObjectTableCore
     1       analyzeObjectTableCore
     1      photometricCatalogMatch
     1            refCatObjectTract
     1 photometricRefCatObjectTract

parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 17614
parsl.dataflow.dflow INFO: Tasks in state States.failed: 148
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 429

A bunch of things failed when the Slurm allocation ended. I'll run the remainder serially on the head node.

pipetask run --register-dataset-types -b $REPO -o target_20250228 -g /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --skip-existing-in target_20250228 --extend-run -j 50 2>&1 | tee -a target_20250228-step3.log

That dragged on for a LONG time, some processes running for thousands of minutes, before the machine was rebooted for the second Tuesday downtime. I'll have to restart it.

pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ . /scratch/gpfs/LSST/stacks/stack_v28/loadLSST.bash
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup lsst_distrib
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup -jr ctrl_bps_parsl
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup -jr gbdes/


Something is trying to use the display:

X connection to localhost:10.0 broken (explicit kill or server shutdown).
pybind11::handle::dec_ref() is being called while the GIL is either not held or invalid. Please see https://pybind11.readthedocs.io/en/stable/advanced/misc.html#common-sources-of-global-interpreter-lock-errors for debugging advice.
If you are convinced there is no bug in your code, you can #define PYBIND11_NO_ASSERT_GIL_HELD_INCREF_DECREF to disable this check. In that case you have to ensure this #define is consistently used for all translation units linked into a given pybind11 extension, otherwise there will be ODR violations. The failing pybind11::handle::dec_ref() call was triggered on a pybind11_type object.
terminate called after throwing an instance of 'std::runtime_error'
  what():  pybind11::handle::dec_ref() PyGILState_Check() failure.
lsst.ctrl.mpexec.mpGraphExecutor ERROR: Task <plotPropertyMapTract dataId={band: 'g', skymap: 'target_v1', tract: 0}> failed, killed by signal 6 (Aborted); processing will continue for remaining tasks.


There's one job that's been running for 8386 minutes, and it's the only thing running now, blocking another 10 jobs. I think I'm going to kill it, and then exclude that patch from the processing.

lsst.ctrl.mpexec.mpGraphExecutor INFO: Executed 18625 quanta successfully, 14 failed and 10 remain out of total 18649 quanta.

I believe the 14 failures are due to the above display problem.

(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ unset DISPLAY
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ pipetask run --register-dataset-types -b $REPO -o target_20250228 -g /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --skip-existing-in target_20250228 --extend-run -j 50 2>&1 | tee -a target_20250228-step3.log

Hopefully the log will allow me to identify the patch that needs to be excluded.

lsst.ctrl.mpexec.singleQuantumExecutor INFO: Preparing execution of quantum for 
label=transformObjectTable dataId={skymap: 'target_v1', tract: 0, patch: 136}.


RuntimeError: Registry inconsistency while checking for existing quantum outputs: quantum=Quantum(taskName=lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask, dataId={band: 'g', skymap: 'target_v1', tract: 0, patch: 98}) existingRefs=[DatasetRef(DatasetType('deepCoadd_measMatchFull', {band, skymap, tract, patch}, Catalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=ab80ca88-84bd-4e68-8fcf-9d68b07ce9d8)] missingRefs=[DatasetRef(DatasetType('deepCoadd_meas', {band, skymap, tract, patch}, SourceCatalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=60ce133f-3f46-4b90-bbad-1c114b4fa002), DatasetRef(DatasetType('deepCoadd_measMatch', {band, skymap, tract, patch}, Catalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=b3184559-74f8-4ee7-b163-990c85459489), DatasetRef(DatasetType('measure_log', {band, skymap, tract, patch}, ButlerLogRecords), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=24edba26-eee5-4720-bcf8-b899cfe5de0d), DatasetRef(DatasetType('measure_metadata', {band, skymap, tract, patch}, TaskMetadata), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=3acb1efc-d845-41f9-84e7-e00748e9d839)]


pipetask report $REPO /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --collections target_20250228 > target_20250228-step3.report

            Task             Unknown Successful Blocked Failed Wonky TOTAL EXPECTED
---------------------------- ------- ---------- ------- ------ ----- ----- --------
                    makeWarp       0      16345       0      0     0 16345    16345
       selectDeepCoaddVisits       0        342       0      0     0   342      342
     analyzeMatchedVisitCore       0          1       0      0     0     1        1
               assembleCoadd       0        342       0      0     0   342      342
                   detection       0        342       0      0     0   342      342
      healSparsePropertyMaps       0          3       0      0     0     3        3
             mergeDetections       0        116       0      0     0   116      116
        plotPropertyMapTract       0          3       0      0     0     3        3
                     deblend       0        115       0      1     0   116      116
                     measure       1        338       3      0     0   342      342
           mergeMeasurements       1        114       1      0     0   116      116
             forcedPhotCoadd       3        336       3      0     0   342      342
            writeObjectTable       1        114       1      0     0   116      116
        transformObjectTable       1        114       1      0     0   116      116
      consolidateObjectTable       0          0       1      0     0     1        1
     photometricCatalogMatch       0          0       1      0     0     1        1
      analyzeObjectTableCore       0          0       1      0     0     1        1
     validateObjectTableCore       0          0       1      0     0     1        1
           catalogMatchTract       0          0       1      0     0     1        1
photometricRefCatObjectTract       0          0       1      0     0     1        1
           refCatObjectTract       0          0       1      0     0     1        1

Failed Quanta
[{'Data ID': {'patch': 188, 'skymap': 'target_v1', 'tract': 0},
  'Messages': [],
  'Runs and Status': {'target_20250228/20250304T223529Z': 'FAILED'},
  'Task': 'deblend'}]
Unsuccessful Datasets
[...]
 'deepCoadd_meas': [{'band': 'i', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
                    {'band': 'r', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
                    {'band': 'g', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
                    {'band': 'g', 'patch': 98, 'skymap': 'target_v1', 'tract': 0}],
[...]

Looks like patch=188 is the one that failed with the deblender and patch=98 is the one that is taking FOREVER to run measurement. I think I could fix patch=188 by running with --clobber-outputs (or by manually deleting datasets), but that would hold everything back even longer. Let's try to push through with both of these patches excluded. Then we can come back if necessary.

pipetask run -b $REPO -o target_20250228 --extend-run --skip-existing-in target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step3' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000 AND skymap = 'target_v1' AND patch NOT IN (98, 188)" -j 50 2>&1 | tee -a target_20250228-step3-cleanup.log

lsst.ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 7 quanta for 7 tasks, graph ID: '1742240630.0240083-1627369'
Quanta            Tasks            
------ ----------------------------
     1       consolidateObjectTable
     1            catalogMatchTract
     1      photometricCatalogMatch
     1       analyzeObjectTableCore
     1      validateObjectTableCore
     1            refCatObjectTract
     1 photometricRefCatObjectTract

lsst.ctrl.mpexec.mpGraphExecutor INFO: Executed 7 quanta successfully, 0 failed and 0 remain out of total 7 quanta.

Hooray! There's an "objectTable_tract" that hopefully contains everything we care about. The butler reads it in as a pandas DataFrame.

Flux units are in nJy (the images are warped so that pixel fluxes are in nJy), so the magnitude zero point is 31.4.

Thank you for the quick replies. It sounds like this was very timely! I’ll work through these notes and let you know if I have any problems. Thanks again!