Reducing some HSC data with the LSST v28.0.1 stack
Set up the stack:
pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ . /scratch/gpfs/LSST/stacks/stack_v28/loadLSST.bash
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup lsst_distrib
Set up the Postgresql database that I'll use for a registry:
In ~/.lsst/db-auth.yaml:
- url: postgresql://dbserver:5432/dbname
  username: myUserName
  password: myPasswrod
export REPO=/scratch/gpfs/RUBIN/user/price/REPO
mkdir -p $REPO
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price/REPO $ cat seed-config.yaml
datastore:
  root: <butlerRoot>
registry:
  db: postgresql+psycopg2://dbserver:5432/dbname
  namespace: dbname_20250108
In psql:
dbname=> CREATE SCHEMA dbname_20250108;
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler create --seed-config seed-config.yaml $REPO
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler register-instrument $REPO lsst.obs.subaru.HyperSuprimeCam
Next we want to ingest the reference catalogs. The PS1 refcat is about 400 GB, and I don't want to copy all of that in order to reduce a small number of discrete and known pointings.
butler register-dataset-type $REPO gaia_dr2_20200414 SimpleCatalog htm7
butler register-dataset-type $REPO ps1_pv3_3pi_20170110 SimpleCatalog htm7
Now we need to identify the appropriate files to ingest.
cp /projects/HSC/refcats/gaia_dr2_20200414.ecsv $REPO/
cp /projects/HSC/refcats/ps1_pv3_3pi_20170110.ecsv $REPO/
import numpy as np
import astropy.units as u
from astropy.coordinates import SkyCoord
from astropy.table import Table
from lsst.meas.algorithms.htmIndexer import HtmIndexer
from lsst.geom import SpherePoint, degrees
indexer = HtmIndexer(depth=7)
target = SkyCoord("1h23m45.678s", "12d34m56.78s", unit=(u.hourangle, u.deg))
shard, isBorder = indexer.getShardIds(SpherePoint(target.ra.deg*degrees, target.dec.deg*degrees), 3*degrees)
gaia = Table.read("gaia_dr2_20200414.ecsv")
gaia[np.isin(gaia["htm7"], shard)].write("gaia_target.ecsv")
ps1 = Table.read("ps1_pv3_3pi_20170110.ecsv")
ps1[np.isin(ps1["htm7"], shard)].write("ps1_target.ecsv")
Now ingest the files:
butler ingest-files -t copy $REPO gaia_dr2_20200414 refcats/gen2 gaia_target.ecsv
butler ingest-files -t copy $REPO ps1_pv3_3pi_20170110 refcats/gen2 ps1_target.ecsv
Register the skymap:
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ cat skymap-target.py
config.skyMap = "discrete"
config.skyMap["discrete"].raList=[20.940325]
config.skyMap["discrete"].decList=[12.58243889]
config.skyMap["discrete"].radiusList=[2.3]
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler register-skymap $REPO -C skymap-target.py -c name='target_v1'
Further instrument setup:
butler write-curated-calibrations $REPO lsst.obs.subaru.HyperSuprimeCam --collection HSC/calib
Now ingest the raw images:
for dd in /projects/HSC/users/price/target/raw-*; do
    butler ingest-raws $REPO $dd/HSCA*.fits* --transfer copy 2>&1 | tee -a ingest-$(basename $dd).log
done
I downloaded the calibs from Sogo Mineo and the SSP: https://tigress-web.princeton.edu/~pprice/HSC-calibs/
butler import $REPO /scratch/gpfs/RUBIN/datasets/calibs/s23b_wide_calib/ -t link
butler collection-chain $REPO HSC/defaults HSC/raw/all,HSC/calib,HSC/calib/gen2/CALIB_tp,HSC/calib/s23b_sky_rev,refcats/gen2,skymaps
Last setup step:
butler define-visits $REPO HSC
And now we should be able to run some data.
pipetask run --register-dataset-types -p "${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1" -d "instrument = 'HSC' AND exposure = 12345 AND detector = 49" -b $REPO -i HSC/defaults -o test-20250121
That worked! Time to expand.
Tiger3 compute nodes have 112 cores and 1TB memory.
wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
computeSite: tiger_1n_6h
site:
  local:
    class: lsst.ctrl.bps.parsl.sites.Local
    cores: 12
  tiger_1n_6h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 112
    walltime: "06:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    scheduler_options: "#SBATCH --account=rubin"
Urgh, the query syntax doesn't support "LIKE", so I need to list all the target names and filters explicitly.
bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o target_20240121 -d "instrument = 'HSC' AND exposure.target_name IN ('TARGET_1', 'TARGET_2') AND exposure.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9"
FileNotFoundError: Not enough datasets (0) found for non-optional connection calibrateImage.astrometry_ref_cat (ps1_pv3_3pi_20170110) with minimum=1 for quantum data ID {instrument: 'HSC', detector: 59, visit: 12345, band: 'g', day_obs: 20150123, physical_filter: 'HSC-G'}.
Looks like the 3 degree radius was insufficient. We may as well ingest the whole refcats now: the calibs are 1.2 TB, so adding the refcats isn't a huge deal, and it will help others who want them as well.
butler remove-runs $REPO refcats/gen2
Switching GAIA from DR2 to DR3, because I see the latter is available.
Linking instead of copying, because there's now a copy of the files on /scratch/gpfs/RUBIN.
butler register-dataset-type $REPO gaia_dr3_20230707 SimpleCatalog htm7
cd /scratch/gpfs/RUBIN/datasets/refcats/
butler ingest-files -t link $REPO gaia_dr3_20230707 refcats/gaia gaia_dr3_20230707/gaia_dr3_20230707.ecsv
butler ingest-files -t link $REPO ps1_pv3_3pi_20170110 refcats/ps1 ps1_pv3_3pi_20170110/ps1_pv3_3pi_20170110.ecsv
Update the "HSC/defaults" chain:
butler remove-collections $REPO HSC/defaults
butler collection-chain $REPO HSC/defaults HSC/raw/all,HSC/calib,HSC/calib/gen2/CALIB_tp,HSC/calib/s23b_sky_rev,refcats/gaia,refcats/ps1,skymaps
Trying the "bps submit" command again...
Quanta          Tasks          
------ ------------------------
 80475                      isr
 80475           calibrateImage
 80475 analyzeAmpOffsetMetadata
 80475  transformPreSourceTable
Error 23:
        Failed to start block 22: Cannot launch job parsl.tiger.block-22.1737501036.8215468: Could not read job ID from submit command standard output; recode=1, stdout=, stderr=sbatch: error: ERROR: You have to specify an account for your slurm jobs with --account option from these options: merian rubin
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Our scheduler_options didn't make it into the submission script... Oh, it did, but without the leading "#SBATCH".
Scale down to verify we've fixed the problem...
bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o test-20250121 -d "instrument = 'HSC' AND exposure = 12345 AND detector != 9"
Yep, that worked. Now we can try the full run again.
lsst.ctrl.bps.drivers INFO: Submit stage completed: Took 96299.4631 seconds; current memory usage: 4.697 Gibyte, delta: 0.402 Gibyte, peak delta: 0.017 Gibyte
lsst.ctrl.bps.drivers INFO: Submission process completed: Took 98858.9745 seconds; current memory usage: 4.697 Gibyte, delta: 4.511 Gibyte, peak delta: 4.511 Gibyte
lsst.ctrl.bps.drivers INFO: Peak memory usage for bps process 4.697 Gibyte (main), 9.537 Gibyte (largest child process)
Run Id: None
Run Name: target_20240121_20250121T233151Z
step1: detector
step2a: visit
step2b: tract (after step2a)
step2c: instrument (after step2a)
step2d: visit (after step2c)
step2e: instrument (after step2d)
step3: tract
step4: detector (skip: not for wallpaper science)
step7: instrument (after step3)
I want to add the clustering configuration to bps.yaml to attempt to improve the efficiency (and adding account directly, using DM-48539).
wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
computeSite: tiger_1n_6h
includeConfigs:
  - ${DRP_PIPE_DIR}/bps/clustering/DRP-recalibrated.yaml
site:
  local:
    class: lsst.ctrl.bps.parsl.sites.Local
    cores: 12
  tiger_1n_6h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 112
    walltime: "06:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2a' -b $REPO -i HSC/defaults -o target_20240121 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9"
FileNotFoundError: Not enough datasets (0) found for non-optional connection skyCorr.skyFrames (sky) with minimum=1 for quantum data ID {instrument: 'HSC', visit: 12345, band: 'i', day_obs: 20150123, physical_filter: 'HSC-I'}.
That fails because there aren't any sky frames for HSC-I data taken on that date. I'll need to move the certification dates around...
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ python updateSkyCalibs.py
Updating 103 datasets from cpsky_g_140918_141001
Updating 103 datasets from cpsky_g_150325_150325
Updating 103 datasets from cpsky_g_151114_160111
Updating 103 datasets from cpsky_g_160307_160307
Updating 103 datasets from cpsky_r_150318_150318
Updating 103 datasets from cpsky_r2_211209_211209
Updating 103 datasets from cpsky_i_150121_150121
Updating 103 datasets from cpsky_i_150320_150322
Updating 103 datasets from cpsky_i2_181207_181213
Later discovered that I was missing a step:
pipetask run --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -b $REPO -i HSC/defaults -o HSC/fgcm -p '${FGCMCAL_DIR}/pipelines/_ingredients/fgcmMakeLUT.yaml' -d "instrument = 'HSC'"
That took a while, but it worked. Let's put HSC/fgcm onto HSC/defaults.
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler collection-chain $REPO HSC/defaults --mode extend HSC/fgcm
[HSC/raw/all, HSC/calib, HSC/calib/gen2/CALIB_tp, HSC/calib/s23b_sky_rev, refcats/gaia, refcats/ps1, skymaps, HSC/fgcm]
OK, starting fresh from step1:
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ cat bps.yaml
wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
#computeSite: tiger_1n_112c_5h
includeConfigs:
  - ${DRP_PIPE_DIR}/bps/clustering/DRP-recalibrated.yaml
site:
  local:
    class: lsst.ctrl.bps.parsl.sites.Local
    cores: 12
  tiger_1n_112c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 112
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
  tiger_1n_56c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 56
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
  tiger_1n_28c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 1
    cores_per_node: 28
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
  tiger_2n_112c_5h:
    class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
    nodes: 2
    cores_per_node: 112
    walltime: "05:00:00"
    singleton: True
    max_blocks: 2
    mem_per_node: 980
    account: rubin
bps submit bps.yaml --compute-site tiger_2n_112c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND visit > 10000 AND detector != 9 AND detector < 104"
Quanta          Tasks          
------ ------------------------
 72512                      isr
 72512 analyzeAmpOffsetMetadata
 72512        characterizeImage
 72512                calibrate
 72512      writePreSourceTable
 72512  transformPreSourceTable
Need to use a master version of ctrl_bps_parsl for the 'account' parameter.
parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 142300
parsl.dataflow.dflow INFO: Tasks in state States.failed: 1362
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 1362
That looks fairly successful!
I hit CTRL-C before the whole thing had finished (even though it said it was done), and now the next stage isn't seeing the products created. I think I must have interrupted the final loading of the database. I think I need to run:
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ butler --long-log --log-level=VERBOSE transfer-from-graph submit/target_20250228/20250228T172155Z/target_20250228_20250228T172155Z.qgraph $REPO --register-dataset-types --update-output-chain
VERBOSE 2025-03-01T09:50:08.225-05:00 lsst.daf.butler.direct_butler._direct_butler ()(_direct_butler.py:1877) - 21036 datasets removed because the artifact does not exist. Now have 1719261.
VERBOSE 2025-03-01T22:42:46.472-05:00 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:2504) - Completed scan for missing data files
Number of datasets transferred: 1719261
So now we can move on to step2a.
bps submit bps.yaml --compute-site tiger_1n_28c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2a' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104"
Quanta           Tasks          
------ -------------------------
   704 consolidatePreSourceTable
   704   consolidateVisitSummary
   704                   skyCorr
parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 2112
parsl.dataflow.dflow INFO: Tasks in state States.failed: 0
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 0
bps submit bps.yaml --compute-site tiger_1n_28c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2b' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"
Quanta          Tasks         
------ -----------------------
     5     gbdesAstrometricFit
     1 isolatedStarAssociation
Having trouble with the g-band astrometry again (iterating for a LONG time). Let's use my modified version of gbdes, and run under pipetask instead of bps (because there's so few quanta, and we have our own node now):
pipetask run --register-dataset-types -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2b' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 10
Done.
pipetask run --register-dataset-types -b $REPO -i HSC/defaults -o target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2c' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" 2>&1 | tee target_20250228-step2c.log
Quanta           Tasks           
------ --------------------------
     1 fgcmBuildFromIsolatedStars
     1               fgcmFitCycle
     1         fgcmOutputProducts
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (All) = 4.47 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Blue25) = 4.58 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Middle50) = 4.35 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Red25) = 4.65 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (All) = 4.71 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Blue25) = 5.10 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Middle50) = 4.81 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Red25) = 4.51 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (All) = 4.34 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Blue25) = 4.48 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Middle50) = 4.10 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Red25) = 5.21 mmag
step2d is memory-hungry, so reduce the number of cores per node.
bps submit bps.yaml --compute-site tiger_1n_56c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2d' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"
Quanta            Tasks            
------ ----------------------------
   702     finalizeCharacterization
   683           updateVisitSummary
 69630 writeRecalibratedSourceTable
 69630         transformSourceTable
   683       consolidateSourceTable
That seems to be hanging, taking forever. Let's run it on our head node.
pipetask run --register-dataset-types -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2d' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 56 2>&1 | tee target_20250228-step2d.log
I think that took close to 24 hours, but it completed.
pipetask run --register-dataset-types -b $REPO -i HSC/defaults -o target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2e' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 50 2>&1 | tee target_20250228-step2e.log
Quanta       Tasks      
------ -----------------
     1 makeCcdVisitTable
     1    makeVisitTable
That was fast.
Now we get to the good stuff!
bps submit bps.yaml --compute-site tiger_2n_112c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step3' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"
Quanta            Tasks            
------ ----------------------------
     1      analyzeMatchedVisitCore
 16345                     makeWarp
   342        selectDeepCoaddVisits
   342                assembleCoadd
   342                    detection
     3       healSparsePropertyMaps
   116              mergeDetections
     3         plotPropertyMapTract
   116                      deblend
   342                      measure
   116            mergeMeasurements
   342              forcedPhotCoadd
   116             writeObjectTable
   116         transformObjectTable
     1       consolidateObjectTable
     1            catalogMatchTract
     1      validateObjectTableCore
     1       analyzeObjectTableCore
     1      photometricCatalogMatch
     1            refCatObjectTract
     1 photometricRefCatObjectTract
parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 17614
parsl.dataflow.dflow INFO: Tasks in state States.failed: 148
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 429
A bunch of things failed when the Slurm allocation ended. I'll run the remainder serially on the head node.
pipetask run --register-dataset-types -b $REPO -o target_20250228 -g /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --skip-existing-in target_20250228 --extend-run -j 50 2>&1 | tee -a target_20250228-step3.log
That dragged on for a LONG time, some processes running for thousands of minutes, before the machine was rebooted for the second Tuesday downtime. I'll have to restart it.
pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ . /scratch/gpfs/LSST/stacks/stack_v28/loadLSST.bash
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup lsst_distrib
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup -jr ctrl_bps_parsl
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup -jr gbdes/
Something is trying to use the display:
X connection to localhost:10.0 broken (explicit kill or server shutdown).
pybind11::handle::dec_ref() is being called while the GIL is either not held or invalid. Please see https://pybind11.readthedocs.io/en/stable/advanced/misc.html#common-sources-of-global-interpreter-lock-errors for debugging advice.
If you are convinced there is no bug in your code, you can #define PYBIND11_NO_ASSERT_GIL_HELD_INCREF_DECREF to disable this check. In that case you have to ensure this #define is consistently used for all translation units linked into a given pybind11 extension, otherwise there will be ODR violations. The failing pybind11::handle::dec_ref() call was triggered on a pybind11_type object.
terminate called after throwing an instance of 'std::runtime_error'
  what():  pybind11::handle::dec_ref() PyGILState_Check() failure.
lsst.ctrl.mpexec.mpGraphExecutor ERROR: Task <plotPropertyMapTract dataId={band: 'g', skymap: 'target_v1', tract: 0}> failed, killed by signal 6 (Aborted); processing will continue for remaining tasks.
There's one job that's been running for 8386 minutes, and it's the only thing running now, blocking another 10 jobs. I think I'm going to kill it, and then exclude that patch from the processing.
lsst.ctrl.mpexec.mpGraphExecutor INFO: Executed 18625 quanta successfully, 14 failed and 10 remain out of total 18649 quanta.
I believe the 14 failures are due to the above display problem.
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ unset DISPLAY
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ pipetask run --register-dataset-types -b $REPO -o target_20250228 -g /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --skip-existing-in target_20250228 --extend-run -j 50 2>&1 | tee -a target_20250228-step3.log
Hopefully the log will allow me to identify the patch that needs to be excluded.
lsst.ctrl.mpexec.singleQuantumExecutor INFO: Preparing execution of quantum for 
label=transformObjectTable dataId={skymap: 'target_v1', tract: 0, patch: 136}.
RuntimeError: Registry inconsistency while checking for existing quantum outputs: quantum=Quantum(taskName=lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask, dataId={band: 'g', skymap: 'target_v1', tract: 0, patch: 98}) existingRefs=[DatasetRef(DatasetType('deepCoadd_measMatchFull', {band, skymap, tract, patch}, Catalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=ab80ca88-84bd-4e68-8fcf-9d68b07ce9d8)] missingRefs=[DatasetRef(DatasetType('deepCoadd_meas', {band, skymap, tract, patch}, SourceCatalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=60ce133f-3f46-4b90-bbad-1c114b4fa002), DatasetRef(DatasetType('deepCoadd_measMatch', {band, skymap, tract, patch}, Catalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=b3184559-74f8-4ee7-b163-990c85459489), DatasetRef(DatasetType('measure_log', {band, skymap, tract, patch}, ButlerLogRecords), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=24edba26-eee5-4720-bcf8-b899cfe5de0d), DatasetRef(DatasetType('measure_metadata', {band, skymap, tract, patch}, TaskMetadata), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=3acb1efc-d845-41f9-84e7-e00748e9d839)]
pipetask report $REPO /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --collections target_20250228 > target_20250228-step3.report
            Task             Unknown Successful Blocked Failed Wonky TOTAL EXPECTED
---------------------------- ------- ---------- ------- ------ ----- ----- --------
                    makeWarp       0      16345       0      0     0 16345    16345
       selectDeepCoaddVisits       0        342       0      0     0   342      342
     analyzeMatchedVisitCore       0          1       0      0     0     1        1
               assembleCoadd       0        342       0      0     0   342      342
                   detection       0        342       0      0     0   342      342
      healSparsePropertyMaps       0          3       0      0     0     3        3
             mergeDetections       0        116       0      0     0   116      116
        plotPropertyMapTract       0          3       0      0     0     3        3
                     deblend       0        115       0      1     0   116      116
                     measure       1        338       3      0     0   342      342
           mergeMeasurements       1        114       1      0     0   116      116
             forcedPhotCoadd       3        336       3      0     0   342      342
            writeObjectTable       1        114       1      0     0   116      116
        transformObjectTable       1        114       1      0     0   116      116
      consolidateObjectTable       0          0       1      0     0     1        1
     photometricCatalogMatch       0          0       1      0     0     1        1
      analyzeObjectTableCore       0          0       1      0     0     1        1
     validateObjectTableCore       0          0       1      0     0     1        1
           catalogMatchTract       0          0       1      0     0     1        1
photometricRefCatObjectTract       0          0       1      0     0     1        1
           refCatObjectTract       0          0       1      0     0     1        1
Failed Quanta
[{'Data ID': {'patch': 188, 'skymap': 'target_v1', 'tract': 0},
  'Messages': [],
  'Runs and Status': {'target_20250228/20250304T223529Z': 'FAILED'},
  'Task': 'deblend'}]
Unsuccessful Datasets
[...]
 'deepCoadd_meas': [{'band': 'i', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
                    {'band': 'r', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
                    {'band': 'g', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
                    {'band': 'g', 'patch': 98, 'skymap': 'target_v1', 'tract': 0}],
[...]
Looks like patch=188 is the one that failed with the deblender and patch=98 is the one that is taking FOREVER to run measurement. I think I could fix patch=188 by running with --clobber-outputs (or by manually deleting datasets), but that would hold everything back even longer. Let's try to push through with both of these patches excluded. Then we can come back if necessary.
pipetask run -b $REPO -o target_20250228 --extend-run --skip-existing-in target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step3' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000 AND skymap = 'target_v1' AND patch NOT IN (98, 188)" -j 50 2>&1 | tee -a target_20250228-step3-cleanup.log
lsst.ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 7 quanta for 7 tasks, graph ID: '1742240630.0240083-1627369'
Quanta            Tasks            
------ ----------------------------
     1       consolidateObjectTable
     1            catalogMatchTract
     1      photometricCatalogMatch
     1       analyzeObjectTableCore
     1      validateObjectTableCore
     1            refCatObjectTract
     1 photometricRefCatObjectTract
lsst.ctrl.mpexec.mpGraphExecutor INFO: Executed 7 quanta successfully, 0 failed and 0 remain out of total 7 quanta.
Hooray! There's an "objectTable_tract" that hopefully contains everything we care about. The butler reads it in as a pandas DataFrame.
Flux units are in nJy (the images are warped so that pixel fluxes are in nJy), so the magnitude zero point is 31.4.