I’m curious what Catalog.append(record) does in Python? Is the record deep copied or shallow-copied, assuming the two catalogs have the same schema?
This comes up in the context of star selectors. BaseStarSelector.run would like to set a flag in the input catalog and the returned subset catalog for every source in the returned subset catalog. Is it safe to simply set the flag for every source in that subset catalog (that would work if the subset contains shallow copies of records)?
Also, setting the flag using an explicit loop is quite slow (even using a key) and I’m wondering if there is a better way. I’d like to use the numpy array interface, but can a catalog containing shallow pointers to records be contiguous? Obviously it could be done in C++ but a full suite of optimized functions would be huge (get flag, set flag, comparison operators…).
Also, for my own record: update the documentation of BaseStarSelector.run and .selectStars to say whether the returned catalog contains shallow or deep copies of records. The lack of documentation suggests “deep” but explicit is better than implicit.
Catalog essentially always does shallow copies unless explicitly asked for deep copies. That includes append and subset operations, so the way you propose doing flag-setting in StarSelectors should work fine.
I don’t think there’s an easily solution for the iteration speed issue, though. While a shallow subset of a catalog can be contiguous, that only happens if the subset contains records that were adjacent in the original catalog, which isn’t the case for the subset chosen by a StarSelector.
I have two vague thoughts on how we could get what we want here:
The SQL parser for afw.table idea that I’ve been floating in various contexts might be able to help with this, though I think it’d need single-table UPDATE statements to solve this support, not just WHERE filtering.
We could look at the NumPy interfaces for creating array-like objects, and attempt to implement a NumPy-compatible column object that works for non-contiguous arrays.
Both of those options would of course be a lot of work. So my feeling is that beyond using Keys instead of strings (or speeding up string-based access as per DM-6823), further work to speed this up is premature optimization. I know @RHL has been on a big push lately to make sure we don’t slow things down unnecessarily, but here improving the performance is not low-hanging fruit.
How hard would it be to expand set and get to work with non-contiguous catalogs? Set seems straightforward, in theory. In Python cat["foo"] = scalar or array ought to be able to work quickly (if the iteration is performed in C++) and without any rude surprises. Get is a bit iffier, in that cat["foo"] would have to return a deep copy if the catalog was not contiguous, and not knowing if the data is deep or shallow is a pitfall. A named method would help, e.g. cat.getdeep("foo"}. More traditionally we offer get with a flag to control deep copying, but this seems redundant to me because get[] is the preferred way to obtain a shallow copy.
I think you’ve identified all the problems; it’s not that hard to implement if you’re willing to sometimes return a copy instead of a view, but being willing to do either makes it hard to come up with an idiomatic interface that isn’t full of unpleasant surprises.
Thank you. I have filed https://jira.lsstcorp.org/browse/DM-6889 that presents a no-surprises interface. Whether it is practical to implement is another question.