Reducing some HSC data with the LSST v28.0.1 stack
Set up the stack:
pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ . /scratch/gpfs/LSST/stacks/stack_v28/loadLSST.bash
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup lsst_distrib
Set up the Postgresql database that I'll use for a registry:
In ~/.lsst/db-auth.yaml:
- url: postgresql://dbserver:5432/dbname
username: myUserName
password: myPasswrod
export REPO=/scratch/gpfs/RUBIN/user/price/REPO
mkdir -p $REPO
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price/REPO $ cat seed-config.yaml
datastore:
root: <butlerRoot>
registry:
db: postgresql+psycopg2://dbserver:5432/dbname
namespace: dbname_20250108
In psql:
dbname=> CREATE SCHEMA dbname_20250108;
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler create --seed-config seed-config.yaml $REPO
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler register-instrument $REPO lsst.obs.subaru.HyperSuprimeCam
Next we want to ingest the reference catalogs. The PS1 refcat is about 400 GB, and I don't want to copy all of that in order to reduce a small number of discrete and known pointings.
butler register-dataset-type $REPO gaia_dr2_20200414 SimpleCatalog htm7
butler register-dataset-type $REPO ps1_pv3_3pi_20170110 SimpleCatalog htm7
Now we need to identify the appropriate files to ingest.
cp /projects/HSC/refcats/gaia_dr2_20200414.ecsv $REPO/
cp /projects/HSC/refcats/ps1_pv3_3pi_20170110.ecsv $REPO/
import numpy as np
import astropy.units as u
from astropy.coordinates import SkyCoord
from astropy.table import Table
from lsst.meas.algorithms.htmIndexer import HtmIndexer
from lsst.geom import SpherePoint, degrees
indexer = HtmIndexer(depth=7)
target = SkyCoord("1h23m45.678s", "12d34m56.78s", unit=(u.hourangle, u.deg))
shard, isBorder = indexer.getShardIds(SpherePoint(target.ra.deg*degrees, target.dec.deg*degrees), 3*degrees)
gaia = Table.read("gaia_dr2_20200414.ecsv")
gaia[np.isin(gaia["htm7"], shard)].write("gaia_target.ecsv")
ps1 = Table.read("ps1_pv3_3pi_20170110.ecsv")
ps1[np.isin(ps1["htm7"], shard)].write("ps1_target.ecsv")
Now ingest the files:
butler ingest-files -t copy $REPO gaia_dr2_20200414 refcats/gen2 gaia_target.ecsv
butler ingest-files -t copy $REPO ps1_pv3_3pi_20170110 refcats/gen2 ps1_target.ecsv
Register the skymap:
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ cat skymap-target.py
config.skyMap = "discrete"
config.skyMap["discrete"].raList=[20.940325]
config.skyMap["discrete"].decList=[12.58243889]
config.skyMap["discrete"].radiusList=[2.3]
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler register-skymap $REPO -C skymap-target.py -c name='target_v1'
Further instrument setup:
butler write-curated-calibrations $REPO lsst.obs.subaru.HyperSuprimeCam --collection HSC/calib
Now ingest the raw images:
for dd in /projects/HSC/users/price/target/raw-*; do
butler ingest-raws $REPO $dd/HSCA*.fits* --transfer copy 2>&1 | tee -a ingest-$(basename $dd).log
done
I downloaded the calibs from Sogo Mineo and the SSP: https://tigress-web.princeton.edu/~pprice/HSC-calibs/
butler import $REPO /scratch/gpfs/RUBIN/datasets/calibs/s23b_wide_calib/ -t link
butler collection-chain $REPO HSC/defaults HSC/raw/all,HSC/calib,HSC/calib/gen2/CALIB_tp,HSC/calib/s23b_sky_rev,refcats/gen2,skymaps
Last setup step:
butler define-visits $REPO HSC
And now we should be able to run some data.
pipetask run --register-dataset-types -p "${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1" -d "instrument = 'HSC' AND exposure = 12345 AND detector = 49" -b $REPO -i HSC/defaults -o test-20250121
That worked! Time to expand.
Tiger3 compute nodes have 112 cores and 1TB memory.
wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
computeSite: tiger_1n_6h
site:
local:
class: lsst.ctrl.bps.parsl.sites.Local
cores: 12
tiger_1n_6h:
class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
nodes: 1
cores_per_node: 112
walltime: "06:00:00"
singleton: True
max_blocks: 2
mem_per_node: 980
scheduler_options: "#SBATCH --account=rubin"
Urgh, the query syntax doesn't support "LIKE", so I need to list all the target names and filters explicitly.
bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o target_20240121 -d "instrument = 'HSC' AND exposure.target_name IN ('TARGET_1', 'TARGET_2') AND exposure.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9"
FileNotFoundError: Not enough datasets (0) found for non-optional connection calibrateImage.astrometry_ref_cat (ps1_pv3_3pi_20170110) with minimum=1 for quantum data ID {instrument: 'HSC', detector: 59, visit: 12345, band: 'g', day_obs: 20150123, physical_filter: 'HSC-G'}.
Looks like the 3 degree radius was insufficient. We may as well ingest the whole refcats now: the calibs are 1.2 TB, so adding the refcats isn't a huge deal, and it will help others who want them as well.
butler remove-runs $REPO refcats/gen2
Switching GAIA from DR2 to DR3, because I see the latter is available.
Linking instead of copying, because there's now a copy of the files on /scratch/gpfs/RUBIN.
butler register-dataset-type $REPO gaia_dr3_20230707 SimpleCatalog htm7
cd /scratch/gpfs/RUBIN/datasets/refcats/
butler ingest-files -t link $REPO gaia_dr3_20230707 refcats/gaia gaia_dr3_20230707/gaia_dr3_20230707.ecsv
butler ingest-files -t link $REPO ps1_pv3_3pi_20170110 refcats/ps1 ps1_pv3_3pi_20170110/ps1_pv3_3pi_20170110.ecsv
Update the "HSC/defaults" chain:
butler remove-collections $REPO HSC/defaults
butler collection-chain $REPO HSC/defaults HSC/raw/all,HSC/calib,HSC/calib/gen2/CALIB_tp,HSC/calib/s23b_sky_rev,refcats/gaia,refcats/ps1,skymaps
Trying the "bps submit" command again...
Quanta Tasks
------ ------------------------
80475 isr
80475 calibrateImage
80475 analyzeAmpOffsetMetadata
80475 transformPreSourceTable
Error 23:
Failed to start block 22: Cannot launch job parsl.tiger.block-22.1737501036.8215468: Could not read job ID from submit command standard output; recode=1, stdout=, stderr=sbatch: error: ERROR: You have to specify an account for your slurm jobs with --account option from these options: merian rubin
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Our scheduler_options didn't make it into the submission script... Oh, it did, but without the leading "#SBATCH".
Scale down to verify we've fixed the problem...
bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o test-20250121 -d "instrument = 'HSC' AND exposure = 12345 AND detector != 9"
Yep, that worked. Now we can try the full run again.
lsst.ctrl.bps.drivers INFO: Submit stage completed: Took 96299.4631 seconds; current memory usage: 4.697 Gibyte, delta: 0.402 Gibyte, peak delta: 0.017 Gibyte
lsst.ctrl.bps.drivers INFO: Submission process completed: Took 98858.9745 seconds; current memory usage: 4.697 Gibyte, delta: 4.511 Gibyte, peak delta: 4.511 Gibyte
lsst.ctrl.bps.drivers INFO: Peak memory usage for bps process 4.697 Gibyte (main), 9.537 Gibyte (largest child process)
Run Id: None
Run Name: target_20240121_20250121T233151Z
step1: detector
step2a: visit
step2b: tract (after step2a)
step2c: instrument (after step2a)
step2d: visit (after step2c)
step2e: instrument (after step2d)
step3: tract
step4: detector (skip: not for wallpaper science)
step7: instrument (after step3)
I want to add the clustering configuration to bps.yaml to attempt to improve the efficiency (and adding account directly, using DM-48539).
wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
computeSite: tiger_1n_6h
includeConfigs:
- ${DRP_PIPE_DIR}/bps/clustering/DRP-recalibrated.yaml
site:
local:
class: lsst.ctrl.bps.parsl.sites.Local
cores: 12
tiger_1n_6h:
class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
nodes: 1
cores_per_node: 112
walltime: "06:00:00"
singleton: True
max_blocks: 2
mem_per_node: 980
account: rubin
bps submit bps.yaml -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2a' -b $REPO -i HSC/defaults -o target_20240121 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9"
FileNotFoundError: Not enough datasets (0) found for non-optional connection skyCorr.skyFrames (sky) with minimum=1 for quantum data ID {instrument: 'HSC', visit: 12345, band: 'i', day_obs: 20150123, physical_filter: 'HSC-I'}.
That fails because there aren't any sky frames for HSC-I data taken on that date. I'll need to move the certification dates around...
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ python updateSkyCalibs.py
Updating 103 datasets from cpsky_g_140918_141001
Updating 103 datasets from cpsky_g_150325_150325
Updating 103 datasets from cpsky_g_151114_160111
Updating 103 datasets from cpsky_g_160307_160307
Updating 103 datasets from cpsky_r_150318_150318
Updating 103 datasets from cpsky_r2_211209_211209
Updating 103 datasets from cpsky_i_150121_150121
Updating 103 datasets from cpsky_i_150320_150322
Updating 103 datasets from cpsky_i2_181207_181213
Later discovered that I was missing a step:
pipetask run --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -b $REPO -i HSC/defaults -o HSC/fgcm -p '${FGCMCAL_DIR}/pipelines/_ingredients/fgcmMakeLUT.yaml' -d "instrument = 'HSC'"
That took a while, but it worked. Let's put HSC/fgcm onto HSC/defaults.
(lsst-scipipe-9.0.0) pprice@tiger3:/scratch/gpfs/RUBIN/user/price $ butler collection-chain $REPO HSC/defaults --mode extend HSC/fgcm
[HSC/raw/all, HSC/calib, HSC/calib/gen2/CALIB_tp, HSC/calib/s23b_sky_rev, refcats/gaia, refcats/ps1, skymaps, HSC/fgcm]
OK, starting fresh from step1:
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ cat bps.yaml
wmsServiceClass: lsst.ctrl.bps.parsl.service.ParslService
#computeSite: tiger_1n_112c_5h
includeConfigs:
- ${DRP_PIPE_DIR}/bps/clustering/DRP-recalibrated.yaml
site:
local:
class: lsst.ctrl.bps.parsl.sites.Local
cores: 12
tiger_1n_112c_5h:
class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
nodes: 1
cores_per_node: 112
walltime: "05:00:00"
singleton: True
max_blocks: 2
mem_per_node: 980
account: rubin
tiger_1n_56c_5h:
class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
nodes: 1
cores_per_node: 56
walltime: "05:00:00"
singleton: True
max_blocks: 2
mem_per_node: 980
account: rubin
tiger_1n_28c_5h:
class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
nodes: 1
cores_per_node: 28
walltime: "05:00:00"
singleton: True
max_blocks: 2
mem_per_node: 980
account: rubin
tiger_2n_112c_5h:
class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
nodes: 2
cores_per_node: 112
walltime: "05:00:00"
singleton: True
max_blocks: 2
mem_per_node: 980
account: rubin
bps submit bps.yaml --compute-site tiger_2n_112c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step1' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND visit > 10000 AND detector != 9 AND detector < 104"
Quanta Tasks
------ ------------------------
72512 isr
72512 analyzeAmpOffsetMetadata
72512 characterizeImage
72512 calibrate
72512 writePreSourceTable
72512 transformPreSourceTable
Need to use a master version of ctrl_bps_parsl for the 'account' parameter.
parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 142300
parsl.dataflow.dflow INFO: Tasks in state States.failed: 1362
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 1362
That looks fairly successful!
I hit CTRL-C before the whole thing had finished (even though it said it was done), and now the next stage isn't seeing the products created. I think I must have interrupted the final loading of the database. I think I need to run:
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ butler --long-log --log-level=VERBOSE transfer-from-graph submit/target_20250228/20250228T172155Z/target_20250228_20250228T172155Z.qgraph $REPO --register-dataset-types --update-output-chain
VERBOSE 2025-03-01T09:50:08.225-05:00 lsst.daf.butler.direct_butler._direct_butler ()(_direct_butler.py:1877) - 21036 datasets removed because the artifact does not exist. Now have 1719261.
VERBOSE 2025-03-01T22:42:46.472-05:00 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:2504) - Completed scan for missing data files
Number of datasets transferred: 1719261
So now we can move on to step2a.
bps submit bps.yaml --compute-site tiger_1n_28c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2a' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104"
Quanta Tasks
------ -------------------------
704 consolidatePreSourceTable
704 consolidateVisitSummary
704 skyCorr
parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 2112
parsl.dataflow.dflow INFO: Tasks in state States.failed: 0
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 0
bps submit bps.yaml --compute-site tiger_1n_28c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2b' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"
Quanta Tasks
------ -----------------------
5 gbdesAstrometricFit
1 isolatedStarAssociation
Having trouble with the g-band astrometry again (iterating for a LONG time). Let's use my modified version of gbdes, and run under pipetask instead of bps (because there's so few quanta, and we have our own node now):
pipetask run --register-dataset-types -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2b' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 10
Done.
pipetask run --register-dataset-types -b $REPO -i HSC/defaults -o target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2c' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" 2>&1 | tee target_20250228-step2c.log
Quanta Tasks
------ --------------------------
1 fgcmBuildFromIsolatedStars
1 fgcmFitCycle
1 fgcmOutputProducts
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (All) = 4.47 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Blue25) = 4.58 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Middle50) = 4.35 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (g) (Red25) = 4.65 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (All) = 4.71 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Blue25) = 5.10 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Middle50) = 4.81 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (r) (Red25) = 4.51 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (All) = 4.34 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Blue25) = 4.48 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Middle50) = 4.10 mmag
lsst.fgcmFitCycle INFO: reserved/crunched sigFgcm (i) (Red25) = 5.21 mmag
step2d is memory-hungry, so reduce the number of cores per node.
bps submit bps.yaml --compute-site tiger_1n_56c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2d' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"
Quanta Tasks
------ ----------------------------
702 finalizeCharacterization
683 updateVisitSummary
69630 writeRecalibratedSourceTable
69630 transformSourceTable
683 consolidateSourceTable
That seems to be hanging, taking forever. Let's run it on our head node.
pipetask run --register-dataset-types -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2d' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 56 2>&1 | tee target_20250228-step2d.log
I think that took close to 24 hours, but it completed.
pipetask run --register-dataset-types -b $REPO -i HSC/defaults -o target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step2e' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000" -j 50 2>&1 | tee target_20250228-step2e.log
Quanta Tasks
------ -----------------
1 makeCcdVisitTable
1 makeVisitTable
That was fast.
Now we get to the good stuff!
bps submit bps.yaml --compute-site tiger_2n_112c_5h -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step3' -b $REPO -i HSC/defaults -o target_20250228 -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000"
Quanta Tasks
------ ----------------------------
1 analyzeMatchedVisitCore
16345 makeWarp
342 selectDeepCoaddVisits
342 assembleCoadd
342 detection
3 healSparsePropertyMaps
116 mergeDetections
3 plotPropertyMapTract
116 deblend
342 measure
116 mergeMeasurements
342 forcedPhotCoadd
116 writeObjectTable
116 transformObjectTable
1 consolidateObjectTable
1 catalogMatchTract
1 validateObjectTableCore
1 analyzeObjectTableCore
1 photometricCatalogMatch
1 refCatObjectTract
1 photometricRefCatObjectTract
parsl.dataflow.dflow INFO: Tasks in state States.exec_done: 17614
parsl.dataflow.dflow INFO: Tasks in state States.failed: 148
parsl.dataflow.dflow INFO: Tasks in state States.dep_fail: 429
A bunch of things failed when the Slurm allocation ended. I'll run the remainder serially on the head node.
pipetask run --register-dataset-types -b $REPO -o target_20250228 -g /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --skip-existing-in target_20250228 --extend-run -j 50 2>&1 | tee -a target_20250228-step3.log
That dragged on for a LONG time, some processes running for thousands of minutes, before the machine was rebooted for the second Tuesday downtime. I'll have to restart it.
pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ . /scratch/gpfs/LSST/stacks/stack_v28/loadLSST.bash
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup lsst_distrib
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup -jr ctrl_bps_parsl
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ setup -jr gbdes/
Something is trying to use the display:
X connection to localhost:10.0 broken (explicit kill or server shutdown).
pybind11::handle::dec_ref() is being called while the GIL is either not held or invalid. Please see https://pybind11.readthedocs.io/en/stable/advanced/misc.html#common-sources-of-global-interpreter-lock-errors for debugging advice.
If you are convinced there is no bug in your code, you can #define PYBIND11_NO_ASSERT_GIL_HELD_INCREF_DECREF to disable this check. In that case you have to ensure this #define is consistently used for all translation units linked into a given pybind11 extension, otherwise there will be ODR violations. The failing pybind11::handle::dec_ref() call was triggered on a pybind11_type object.
terminate called after throwing an instance of 'std::runtime_error'
what(): pybind11::handle::dec_ref() PyGILState_Check() failure.
lsst.ctrl.mpexec.mpGraphExecutor ERROR: Task <plotPropertyMapTract dataId={band: 'g', skymap: 'target_v1', tract: 0}> failed, killed by signal 6 (Aborted); processing will continue for remaining tasks.
There's one job that's been running for 8386 minutes, and it's the only thing running now, blocking another 10 jobs. I think I'm going to kill it, and then exclude that patch from the processing.
lsst.ctrl.mpexec.mpGraphExecutor INFO: Executed 18625 quanta successfully, 14 failed and 10 remain out of total 18649 quanta.
I believe the 14 failures are due to the above display problem.
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ unset DISPLAY
(lsst-scipipe-9.0.0) pprice@tiger3-sumire:/scratch/gpfs/RUBIN/user/price $ pipetask run --register-dataset-types -b $REPO -o target_20250228 -g /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --skip-existing-in target_20250228 --extend-run -j 50 2>&1 | tee -a target_20250228-step3.log
Hopefully the log will allow me to identify the patch that needs to be excluded.
lsst.ctrl.mpexec.singleQuantumExecutor INFO: Preparing execution of quantum for
label=transformObjectTable dataId={skymap: 'target_v1', tract: 0, patch: 136}.
RuntimeError: Registry inconsistency while checking for existing quantum outputs: quantum=Quantum(taskName=lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask, dataId={band: 'g', skymap: 'target_v1', tract: 0, patch: 98}) existingRefs=[DatasetRef(DatasetType('deepCoadd_measMatchFull', {band, skymap, tract, patch}, Catalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=ab80ca88-84bd-4e68-8fcf-9d68b07ce9d8)] missingRefs=[DatasetRef(DatasetType('deepCoadd_meas', {band, skymap, tract, patch}, SourceCatalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=60ce133f-3f46-4b90-bbad-1c114b4fa002), DatasetRef(DatasetType('deepCoadd_measMatch', {band, skymap, tract, patch}, Catalog), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=b3184559-74f8-4ee7-b163-990c85459489), DatasetRef(DatasetType('measure_log', {band, skymap, tract, patch}, ButlerLogRecords), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=24edba26-eee5-4720-bcf8-b899cfe5de0d), DatasetRef(DatasetType('measure_metadata', {band, skymap, tract, patch}, TaskMetadata), {band: 'g', skymap: 'target_v1', tract: 0, patch: 98}, run='target_20250228/20250304T223529Z', id=3acb1efc-d845-41f9-84e7-e00748e9d839)]
pipetask report $REPO /scratch/gpfs/RUBIN/user/price/submit/target_20250228/20250304T223529Z/target_20250228_20250304T223529Z.qgraph --collections target_20250228 > target_20250228-step3.report
Task Unknown Successful Blocked Failed Wonky TOTAL EXPECTED
---------------------------- ------- ---------- ------- ------ ----- ----- --------
makeWarp 0 16345 0 0 0 16345 16345
selectDeepCoaddVisits 0 342 0 0 0 342 342
analyzeMatchedVisitCore 0 1 0 0 0 1 1
assembleCoadd 0 342 0 0 0 342 342
detection 0 342 0 0 0 342 342
healSparsePropertyMaps 0 3 0 0 0 3 3
mergeDetections 0 116 0 0 0 116 116
plotPropertyMapTract 0 3 0 0 0 3 3
deblend 0 115 0 1 0 116 116
measure 1 338 3 0 0 342 342
mergeMeasurements 1 114 1 0 0 116 116
forcedPhotCoadd 3 336 3 0 0 342 342
writeObjectTable 1 114 1 0 0 116 116
transformObjectTable 1 114 1 0 0 116 116
consolidateObjectTable 0 0 1 0 0 1 1
photometricCatalogMatch 0 0 1 0 0 1 1
analyzeObjectTableCore 0 0 1 0 0 1 1
validateObjectTableCore 0 0 1 0 0 1 1
catalogMatchTract 0 0 1 0 0 1 1
photometricRefCatObjectTract 0 0 1 0 0 1 1
refCatObjectTract 0 0 1 0 0 1 1
Failed Quanta
[{'Data ID': {'patch': 188, 'skymap': 'target_v1', 'tract': 0},
'Messages': [],
'Runs and Status': {'target_20250228/20250304T223529Z': 'FAILED'},
'Task': 'deblend'}]
Unsuccessful Datasets
[...]
'deepCoadd_meas': [{'band': 'i', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
{'band': 'r', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
{'band': 'g', 'patch': 188, 'skymap': 'target_v1', 'tract': 0},
{'band': 'g', 'patch': 98, 'skymap': 'target_v1', 'tract': 0}],
[...]
Looks like patch=188 is the one that failed with the deblender and patch=98 is the one that is taking FOREVER to run measurement. I think I could fix patch=188 by running with --clobber-outputs (or by manually deleting datasets), but that would hold everything back even longer. Let's try to push through with both of these patches excluded. Then we can come back if necessary.
pipetask run -b $REPO -o target_20250228 --extend-run --skip-existing-in target_20250228 -p '${DRP_PIPE_DIR}/pipelines/HSC/DRP-Prod.yaml#step3' -d "instrument = 'HSC' AND visit.target_name IN ('TARGET_1', 'TARGET_2') AND visit.physical_filter IN ('HSC-G', 'HSC-R', 'HSC-R2', 'HSC-I', 'HSC-I2') AND detector != 9 AND detector < 104 AND visit > 10000 AND skymap = 'target_v1' AND patch NOT IN (98, 188)" -j 50 2>&1 | tee -a target_20250228-step3-cleanup.log
lsst.ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 7 quanta for 7 tasks, graph ID: '1742240630.0240083-1627369'
Quanta Tasks
------ ----------------------------
1 consolidateObjectTable
1 catalogMatchTract
1 photometricCatalogMatch
1 analyzeObjectTableCore
1 validateObjectTableCore
1 refCatObjectTract
1 photometricRefCatObjectTract
lsst.ctrl.mpexec.mpGraphExecutor INFO: Executed 7 quanta successfully, 0 failed and 0 remain out of total 7 quanta.
Hooray! There's an "objectTable_tract" that hopefully contains everything we care about. The butler reads it in as a pandas DataFrame.
Flux units are in nJy (the images are warped so that pixel fluxes are in nJy), so the magnitude zero point is 31.4.