Copying a collection

Can someone please share some code with an example of how to create a copy of a collection? I want to change a few calexps and create a new collection with only those calexps changed. How can I accomplish that?


I think in general you would be encouraged not to do this. Instead, you can create a CHAINED collection that includes your original collection plus a new RUN collection with the replacement files “stacked on top” (make sure the ordering is correct). That way no additional space is used.

You are right, that’s the proper way to do it, thank you.

Now I have another problem. I created a new collection with some changed calexps. I saved only those calexps in there. Then I created a chain with the new collection as a single child of the original collection. Then I try to rerun the FGCM pipetask on the original collection (hoping that it will pick up my overriden calexps that exist in the child collection).

However, I get some warnings and an exception. The warnings say that no datasets of types visitSummary, sourceTable_visit and calexpBackground exist in my new child collection (referenced by name in the warnings). The error is:

RuntimeError: 1 dataset(s) of type 'src_schema' was/were present in a previous query, but could not be found now.This is either a logic bug in QuantumGraph generation or the input collections have been modified since QuantumGraph generation began.

Do you have any idea what might be wrong?

Then I created a chain with the new collection as a single child of the original collection. Then I try to rerun the FGCM pipetask on the original collection…

Can you explain what you mean by this (e.g., the command line that created the chain, or the results of calling butler query-collections on the current arrangement)? The original (run?) collection should be the child of a chain, rather than having a child itself.

Hi, this is all based on the tutorial from The LSST Science Pipelines — LSST Science Pipelines
I would like to extend the tutorial by inserting fake sources into the “single_frame” collection (probably creating a new child - or parent? - collection) and then running the rest of the tutorial on the new collection. It seems awkward to me to call “parent” the new collection that is extending an existing one. I would call that a “child” collection. But that’s just me…

I first store some calexps (with Python code) into a new collection in “u/<MY_COLLECTION>”. The I’m creating a chain like this:
butler collection-chain --mode=redefine $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml u//single_frame u/<MY_COLLECTION>

Then I try to run the next step in the tutorial with:
pipetask run -b $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml \ -p $RC2_SUBSET_DIR/pipelines/DRP.yaml#fgcm \ -i u//single_frame \ -o u/fgcm_1638029155 \ --register-dataset-types

But that fails with the above error.

As per your suggestion I tried removing the child collection and then doing this:
butler collection-chain --mode=redefine $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml u/<MY_COLLECTION> u//single_frame
(making u//single_frame a child of u/<MY_COLLECTION>)

But that fails saying that u/<MY_COLLECTION> is not CHAINED collection but a RUN collection. Indeed, that is how I created it from Python code. If I try to use “CHAINED” there, it fails…

So I’m stuck and don’t really understand what would be the recommended way of doing this.

Thanks for your help

Ah, now I understand. I missed the fact that u/$USER/single_frame was already a chained collection, containing the original run (the one you wanted “copied”) inside it.

In that case, @ktl’s original suggestion would have been most easily implemented by using prepend mode:

# For future reference; can't do this anymore
# butler collection-chain --mode=prepend $RC2_SUBSET_DIR/SMALL_HSC/ u/$USER/single_frame u/<MY_COLLECTION>

However, the definition of u/$USER/single_frame has already been overwritten to exclude the original run(s). To recover, first do:

butler query-collections $RC2_SUBSET_DIR/SMALL_HSC/ "u/$USER/single_frame*"

This should return one or more run collections of the form u/$USER/single_frame/<timestamp>. Assuming this is the case, do:

butler collection-chain --mode=redefine $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml u/$USER/single_frame u/<MY_COLLECTION>,<list,of,timestamped,runs>

If there is more than one timestamped run, list them from newest to oldest.

Then run butler query-collections again to confirm that u/$USER/single_frame is a chain containing u/<MY_COLLECTION>, followed by the timestamped runs you produced in the tutorial. At that point, the FGCM run you quoted should work.

Yes, a single_frame/ collection was inside. I added my new collection and that timestamped collection as children of “single_frame” collection, as you wrote. But when I try running FGCM, now I get this:

FileNotFoundError: Not enough datasets (0) found for non-optional connection fgcmBuildStarsTable.fgcmLookUpTable (fgcmLookUpTable) with minimum=1 for quantum data ID {instrument: 'HSC'}

I can see that u//single_frame/20211023T153536Zis a child of both u//single_frame (I added it there now) and the u//fgcm (from when I first ran the tutorial) collections… So it probably shouldn’t be a child of u//single_frame?

I removed it (didn’t work - the old error again), and readded it and now I’m getting a slightly different error:
FileNotFoundError: Not enough datasets (0) found for non-optional connection fgcmBuildStarsTable.refCat (ps1_pv3_3pi_20170110) with minimum=1 for quantum data ID {instrument: 'HSC'}.

Can you figure out what’s going on?

Partially. The second error is because I made a mistake in my “recovery” advice, and the single_frame collection should also include HSC/RC2/defaults. I’ve reviewed the tutorial again and I think things will work if you append it to single_frame (so that your final chain is [u/<MY_COLLECTION>, u//single_frame/20211023T153536Z, HSC/RC2/defaults]), but since I’ve already been wrong twice…

On the other hand, I don’t understand where fgcmLookUpTable comes from – I’m not familiar with FGCM itself. If it’s not included in RC2/defaults, then I’m afraid I can’t be of more help.

Incidentally, it’s possible that you will get an error about u/fgcm or u/fgcm_1638029155 having an inconsistent definition. If this happens, I recommend simply starting over with a new FGCM output collection. It is possible to delete old collections, but the current process for doing so is a bit dangerous, so I wouldn’t recommend it.

The fact that u//single_frame/20211023T153536Z is included in both u//single_frame and u//fgcm is not a problem. Any time you run pipetask run with both -i and -o, the chain created by -o will include the contents of -i. This is what allows you to, for example, use u//single_frame as an input without explicitly having to include HSC/RC2/defaults (which was used to generate it) again.

Looks like you nailed it. It’s working. Thanks!

1 Like