Adding dict-like interface to PropertySet and PropertyList

We would like to implement a dict-like interface for PropertySet and PropertyList.
I have some uncertainty over what I should implement, so I would really appreciate feedback on this issue here (before possibly going to RFC).
In particular I’d like feedback from @ktl, @RHL, @jbosch, @price, and @rowen.
If people want to look at code I have a prototype implementation in DM-15676.

At minimum this would include definitions for __iter__ (iterate through all the top level keys, also a keys() method), __len__ (how many toplevel keys are there), __delitem__ (delete a key), __eq__ (compare two PropertySets), __setitem__ (assign a new value by inferring the top if needed, and converting a dict to a PropertySet). I believe these are fairly straight forward and non-controversial.

Following from the discussion in RFC-434, it is not clear to me what __getitem__ should do (and correspondingly what get() and setdefault() should do (we do need to change get() to take a default)) since we have already decided that the return value should not dynamically change from scalar to list.

  • Should PropertySet always return a list? (it really is a vector internally). The more Pythonic approach would seem to be to change the return type from list to scalar as needed.
  • Should PropertyList always return a scalar? (we shouldn’t be duplicating FITS headers).

The problem with these approaches is that it’s not symmetric. If you assign 5 to an element in the PropertySet you will get [5] back.
It may be that returning scalar/list as needed will work in a purely dict-like interface.
Confusion begins if people mix standard PropertySet API with dict accessors (ie if they use add()).

For PropertyList assigning a dict/PropertySet is also not symmetric because you get a flattened list out and can never retrieve the hierarchy. However, if necessary we could emulate this by retrieving all the keys with that root and returning a new PropertySet (which would be detached from the original PropertyList).

For update() should it always run combine (possibly converting a dict argument to a PropertySet) or should it work more like dict.update() and assign each item one at a time (so over-writing existing values)? add() would not be part of the dict-like interface.

To match FITS headers from Astropy it seems like PropertyList.__getitem__ should always be able to use an integer index into the ordered list.

My question would be: what problem is PropertySet/PropertyList trying to solve that the standard container classes in C++/Python (say, OrderedDict) don’t already handle? Is it just trying to map FITS headers onto some C++ objects?

I do think we need a bit more than standard containers, both because we need to be able to send these to C++ (which doesn’t have a native map with heterogeneous values) and because I think we want key/value/description, not just key/value in Python.

But I also worry that there isn’t a way out of this mess that doesn’t involve changing these classes significantly on the C++. Given that they are probably much more complex that we need them to be, and that unnecessary complexity is what is causing the interface problems, replacing them may be the easiest option - though I also don’t want to hold up middleware extraction work waiting for that.

In case it’s more doable than I’m guessing, I think the behavior we want is the same for both PropertySet and PropertyList:

  • if you set an item to a list, it replaces any existing value for that key, and a list is what you get back
  • if you set an item to a scalar, it replaces any existing value for that key, and a scalar is what you get back
  • if you want to transform an existing scalar item to a list by appending another scalar item to it, or append/extend an existing list item, those are different explicit operations done by custom, non-dict methods (not __setitem__).

Speaking as a user not a designer here, I never worry about the difference between PropertySet and PropertyList, and always wish that they just behaved like a dict. I take Jim’s points about key/value/comment and the heterogeneous keys in C++, but I’d take a careful look at how we actually use our current classes before putting a lot of work in to e.g. handle values really being vectors correctly.

So, if what we really want to do is rewrite these classes to be more Python-dict-like (well, “rewrite” is an implementation for “drop the automatic vector handling and nesting”, but I think it’s probably the right implementation), here’s an idea for how that might look:

  • While working on the new metadata extraction system, @timj prototypes a new class in pure Python to identify exactly what we want the new API and semantics to be. This should include thinking about what we expect limitations to be in C++ (e.g. we should support a fixed set of value types, not arbitrary Python objects).
  • If necessary, we write converters between this new class and PropertyList, so we can interact with existing metadata extraction code. (I posit that we never actually see PropertySets for FITS headers in Python, because pybind11 will automatically downcast PropertySet pointers to PropertyList).
  • Once the pure Python API is fairly settled, I think I can write a C++ translation pretty quickly (~“a weekend”).
  • Once we have a C++ version of this class, it should be pretty easy to switch to using it in the various places where we read and parse FITS headers, as we don’t actually have too much of that; whether we do this alongside PropertyList versions that are slowly deprecated or in a big-bang transition is TBD.

In the past when we’ve discussed making PropertySet more dict-like, there have been objections on the basis that PropertySet is explicitly not a dict, so making it dict-like would be confusing.

I don’t know that a full rewrite is necessary, but it seems to me that making PropertySet behave more pythonic would go a long way towards addressing my own complaints. In particular, I want:

  • __iter__
  • __contains__
  • __len__

(I haven’t just added these myself because of laziness.)

I think it was a mistake to try to have PropertyList be a subclass of PropertySet, and it was a mistake to allow the “last-added” value of a PropertySet value to be retrieved as a scalar.

I would propose that PropertySet should be a pure hierarchical-dict with known value types including lists/vectors as a separate type distinct from scalars. (And keeping the current property that first use defines the type, which must remain the same thereafter.)

PropertyList should instead be something like astropy.io.fits.Header in C++ (and ideally exactly that class in Python) . Header should not be an acceptable value type for PropertySet.

The repeated header code in PropertyList was intended primarily for HISTORY and COMMENT cards; those are frequently repeated.

All three of those I’ve done on the ticket branch noted above.

astshim gives you that via FitsChan – that was explicitly designed to match the FITS header data model.

I agree with everything @ktl said except:

I’m not sure we want to use exactly astropy.io.fits.Header in Python, as:

  • That would require deep copies every time someone calls getMetadata() on an Exposure. We live with that sort of behavior on regular STL containers, but that’s largely because we get it for free from pybind11 and hence that saves developer time (and in the case of containers with non-Python-builtin, non-shared_ptr value types, it sidesteps some messy lifetime issues, but that’s not relevant here). If we’d have to write our converter, we might as well put that effort into pybind11 wrappers instead. But I’d be all for having our object duck like astropy.io.fits.Header, modulo my next point.

  • I don’t know if astropy.io.fits.Header's focus on being an exact representation of FITS has led to it picking up FITS weirdnesses (e.g. is there special handling for certain strings like NAXIS or COMMENT?). If so, we may not want that, and we’d instead aim for something that is similar enough to the FITS header data model to round-trip through FITS while still maybe being more appropriate for other file formats.

If we’re extracting information from header metadata into C++ objects, there should be much less need for metadata manipulation except on I/O, hopefully reducing the need for and cost of deep copies.

I didn’t see any special-casing of headers in the astropy.io.fits.Header documentation. I agree that genericization would be good, but not at the cost of gratuitous incompatibility. Maybe duck-typing is close enough.

I think we are getting off topic a little bit. Whilst a rewrite of PropertyList seems like something we want, my motivation for now is to allow obs_metadata to work on FITS-header representations that look like Python dicts. Making PropertyList work like a dict will allow me to do the header translations in a generic way supporting Astropy like objects, AST FitsChan and PropertyList. This seems like a win to me rather than using the PropertyList native API.

Can we then please focus on the question I asked above in the behavior of __getitem__() and other edge cases. I have a fully working implementation on the ticket I mention above, but I’m concerned that the behavior might not be what people want. I wonder if @rowen’s objection to weirdness with get() would be assuaged if PropertySet/List knew if it was a scalar or vector item (so you got consistent answer and add() fails for a scalar). I’m not sure that my C++ skills are good enough to add support for that in a reasonable time scale.

I think you didn’t get an answer to the original question is because I don’t think there is a good answer to the question within those bounds - the reason RFC-434 and the original APIs for those classes went they way they did is because the behavior of PropertySet and PropertyList is so different from normal dict behavior that there is no __getitem__ that would be non-surprising.

That said, I do think:

  • @price’s smaller set of Pythonic additions (and possibly more, which I imagine you’ve done) are not at all problematic;
  • I think I agree that making it so the container knows when it has a scalar or a vector would probably make it possible to define a __getitem__ that makes sense.

Another possibility might be to add a Pythonic interface to PropertyList as a separate class (i.e. a shim layer) that you could use in the metadata extraction code, in contexts where you can reject difficult cases (i.e. can’t construct a shim with vector-valued items or nesting, because you know you don’t need them and they’re what make the Pythonic interface hard).

To avoid confusion for the initial implementation, after talking to @jbosch and @rowen , I have removed the __getitem__ support from the initial proposal. We will add it back once daf_base can distinguish between scalars and arrays. In the mean time __len__, __iter__, __setitem__ and __contains__ will work once I merge DM-15676. (waiting on Jenkins).