Ap_pipe needing a lot of memory for a long list of visits

racine · December 2, 2020, 11:58pm

Hi all,

I have a question regarding the use of ap_pipe on a large number of visits.
I am using HSC PDR2 data (I already have calexps and difference images, so I am only performing association, thanks to --reuse-outputs-from differencer).
I created templates for each band, and I am running ap_pipe on a large number of visits, using 16 cores at CC-IN2P3 with -j 16.
Some of my jobs ran ok, for instance one with 24 visits, but some failed.
I have a job with 47 visits that just crashed, with slurm giving me some insight:
maxvmem: 137.644G
maxrss: 28.874G
maxpss: 26.948G
which is over what I am access to.

Beeing an “alert” pipeline, I guess ap_pipe was designed to run on single nights, so not that many visits over months?
How different would it be to run ap_pipe:

visit by visit or night by night (where I would wait for each job to be finished)
visit by visit or night by night all in parallel
with a all the visits provided to the command line the way I do it now?

I guess my question is: can I split my ap_pipe runs by nights and can I run them in parallel, or would that somehow break the “association” ?

Best,

Ben

PS: I think this is related to the questions in this posting : Ap_pipe in parallel jobs, but the answer seemed non conclusive to me, so I just want to check again.

cmorrison · December 3, 2020, 5:31pm

Hi Ben,

You are correct in assuming that the pipeline is designed with running temporally sequential visits in mind (i.e. camera takes a image/visit, data is processed, camera takes next image etc.) So, if you want precisely return the same results each time, your first option, visit-by-visit, is the one you want.

As for the large amount of memory, without a specific run down by task of what is using that memory (or, say, what is different about those 24 visits vs the 47 visits in terms of, for instance, source density), I can’t say for certain what is causing it. If you are attempting to output alerts using the doPackageAlerts config option, that could be the problem as some of the image cutouts put into the alerts can be very large.

racine · December 3, 2020, 6:00pm

Hi Chris,

Thanks a lot for the answer.
Do you know what is the -j option parallized over ?
I could maybe still use -j 16, but use a slurm job that has N calls of ap_pipe for my N visits.
I am not fully sure if that is supposed to be different, or if you are saying that the memory pb is probably not linked to the “multiple visits” problem, but rather due to a given visit that crashes due for instance to the number of sources detected (maybe a single defective one with many artifacts for instance)?

I am not trying to drop alert packages, I actually had to use w_2020_23 instead of v20.0.0, wich I use for the rest of my work, because v20 drops large .avro files (as mentioned here: Run ap_pipe on existing calexps/wcs), and it cannot be turned off.
Is that what you are talking about?
Do you have an easy way to get the “run down by task” of memory usage?
I always run with --profile ap_pipe_profile.dat option, so maybe the information is in there?

Best,

Ben

cmorrison · December 3, 2020, 6:35pm

The parallel option is over ccdVisit so when you feed a large number of vists to a run it will sort of randomly pick which ccdVisit to run. As you suggested, yes you can still use -j and just feed it one visit at a time.

I’m guessing that a single visit has artifiacted data or is somehow different than the rest but I can’t be sure. Who knows, there might be some memory leak problem that when using the -j option. Can’t be certain without really drilling down.

Does the version you are running not have the option doPackageAlerts in diaPipe? In the latest master that option exists and is set to False by default.

It might be worth looking at that output and seeing if a specific task is causing the memory problem. I haven’t drilled down into that output much so you’ll be a bit on your own looking for the answer.

racine · December 3, 2020, 7:15pm

Yes, w_2020_23 does have config.diaPipe.doPackageAlerts=False by default.
I looked into the profile.dat, but it is not written when the job fails, so not helpful, and for the one that succeeded, I used this, but it doesn’t help much either.