We would like to implement a dict-like interface for
I have some uncertainty over what I should implement, so I would really appreciate feedback on this issue here (before possibly going to RFC).
In particular I’d like feedback from @ktl, @RHL, @jbosch, @price, and @rowen.
If people want to look at code I have a prototype implementation in DM-15676.
At minimum this would include definitions for
__iter__ (iterate through all the top level keys, also a
__len__ (how many toplevel keys are there),
__delitem__ (delete a key),
__eq__ (compare two
__setitem__ (assign a new value by inferring the top if needed, and converting a dict to a
PropertySet). I believe these are fairly straight forward and non-controversial.
Following from the discussion in RFC-434, it is not clear to me what
__getitem__ should do (and correspondingly what
setdefault() should do (we do need to change
get() to take a default)) since we have already decided that the return value should not dynamically change from scalar to list.
PropertySet always return a list? (it really is a vector internally). The more Pythonic approach would seem to be to change the return type from list to scalar as needed.
PropertyList always return a scalar? (we shouldn’t be duplicating FITS headers).
The problem with these approaches is that it’s not symmetric. If you assign 5 to an element in the
PropertySet you will get
It may be that returning scalar/list as needed will work in a purely dict-like interface.
Confusion begins if people mix standard
PropertySet API with
dict accessors (ie if they use
PropertyList assigning a
PropertySet is also not symmetric because you get a flattened list out and can never retrieve the hierarchy. However, if necessary we could emulate this by retrieving all the keys with that root and returning a new
PropertySet (which would be detached from the original
update() should it always run
combine (possibly converting a dict argument to a PropertySet) or should it work more like
dict.update() and assign each item one at a time (so over-writing existing values)?
add() would not be part of the dict-like interface.
To match FITS headers from Astropy it seems like
PropertyList.__getitem__ should always be able to use an integer index into the ordered list.
My question would be: what problem is PropertySet/PropertyList trying to solve that the standard container classes in C++/Python (say, OrderedDict) don’t already handle? Is it just trying to map FITS headers onto some C++ objects?
I do think we need a bit more than standard containers, both because we need to be able to send these to C++ (which doesn’t have a native map with heterogeneous values) and because I think we want key/value/description, not just key/value in Python.
But I also worry that there isn’t a way out of this mess that doesn’t involve changing these classes significantly on the C++. Given that they are probably much more complex that we need them to be, and that unnecessary complexity is what is causing the interface problems, replacing them may be the easiest option - though I also don’t want to hold up middleware extraction work waiting for that.
In case it’s more doable than I’m guessing, I think the behavior we want is the same for both PropertySet and PropertyList:
- if you set an item to a list, it replaces any existing value for that key, and a list is what you get back
- if you set an item to a scalar, it replaces any existing value for that key, and a scalar is what you get back
- if you want to transform an existing scalar item to a list by appending another scalar item to it, or append/extend an existing list item, those are different explicit operations done by custom, non-dict methods (not
Speaking as a user not a designer here, I never worry about the difference between
PropertyList, and always wish that they just behaved like a
dict. I take Jim’s points about key/value/comment and the heterogeneous keys in C++, but I’d take a careful look at how we actually use our current classes before putting a lot of work in to e.g. handle values really being vectors correctly.
So, if what we really want to do is rewrite these classes to be more Python-dict-like (well, “rewrite” is an implementation for “drop the automatic vector handling and nesting”, but I think it’s probably the right implementation), here’s an idea for how that might look:
- While working on the new metadata extraction system, @timj prototypes a new class in pure Python to identify exactly what we want the new API and semantics to be. This should include thinking about what we expect limitations to be in C++ (e.g. we should support a fixed set of value types, not arbitrary Python objects).
- If necessary, we write converters between this new class and
PropertyList, so we can interact with existing metadata extraction code. (I posit that we never actually see
PropertySets for FITS headers in Python, because pybind11 will automatically downcast
PropertySet pointers to
- Once the pure Python API is fairly settled, I think I can write a C++ translation pretty quickly (~“a weekend”).
- Once we have a C++ version of this class, it should be pretty easy to switch to using it in the various places where we read and parse FITS headers, as we don’t actually have too much of that; whether we do this alongside
PropertyList versions that are slowly deprecated or in a big-bang transition is TBD.
In the past when we’ve discussed making
dict-like, there have been objections on the basis that
PropertySet is explicitly not a
dict, so making it
dict-like would be confusing.
I don’t know that a full rewrite is necessary, but it seems to me that making
PropertySet behave more pythonic would go a long way towards addressing my own complaints. In particular, I want:
(I haven’t just added these myself because of laziness.)
I think it was a mistake to try to have
PropertyList be a subclass of
PropertySet, and it was a mistake to allow the “last-added” value of a
PropertySet value to be retrieved as a scalar.
I would propose that
PropertySet should be a pure hierarchical-
dict with known value types including lists/vectors as a separate type distinct from scalars. (And keeping the current property that first use defines the type, which must remain the same thereafter.)
PropertyList should instead be something like
astropy.io.fits.Header in C++ (and ideally exactly that class in Python) .
Header should not be an acceptable value type for
The repeated header code in
PropertyList was intended primarily for
COMMENT cards; those are frequently repeated.
All three of those I’ve done on the ticket branch noted above.
astshim gives you that via FitsChan – that was explicitly designed to match the FITS header data model.
I agree with everything @ktl said except:
I’m not sure we want to use exactly
astropy.io.fits.Header in Python, as:
That would require deep copies every time someone calls
getMetadata() on an
Exposure. We live with that sort of behavior on regular STL containers, but that’s largely because we get it for free from pybind11 and hence that saves developer time (and in the case of containers with non-Python-builtin, non-
shared_ptr value types, it sidesteps some messy lifetime issues, but that’s not relevant here). If we’d have to write our converter, we might as well put that effort into pybind11 wrappers instead. But I’d be all for having our object duck like
astropy.io.fits.Header, modulo my next point.
I don’t know if
astropy.io.fits.Header's focus on being an exact representation of FITS has led to it picking up FITS weirdnesses (e.g. is there special handling for certain strings like
COMMENT?). If so, we may not want that, and we’d instead aim for something that is similar enough to the FITS header data model to round-trip through FITS while still maybe being more appropriate for other file formats.
If we’re extracting information from header metadata into C++ objects, there should be much less need for metadata manipulation except on I/O, hopefully reducing the need for and cost of deep copies.
I didn’t see any special-casing of headers in the
astropy.io.fits.Header documentation. I agree that genericization would be good, but not at the cost of gratuitous incompatibility. Maybe duck-typing is close enough.
I think we are getting off topic a little bit. Whilst a rewrite of
PropertyList seems like something we want, my motivation for now is to allow
obs_metadata to work on FITS-header representations that look like Python dicts. Making
PropertyList work like a dict will allow me to do the header translations in a generic way supporting Astropy like objects, AST FitsChan and PropertyList. This seems like a win to me rather than using the PropertyList native API.
Can we then please focus on the question I asked above in the behavior of
__getitem__() and other edge cases. I have a fully working implementation on the ticket I mention above, but I’m concerned that the behavior might not be what people want. I wonder if @rowen’s objection to weirdness with
get() would be assuaged if PropertySet/List knew if it was a scalar or vector item (so you got consistent answer and
add() fails for a scalar). I’m not sure that my C++ skills are good enough to add support for that in a reasonable time scale.
I think you didn’t get an answer to the original question is because I don’t think there is a good answer to the question within those bounds - the reason RFC-434 and the original APIs for those classes went they way they did is because the behavior of
PropertyList is so different from normal
dict behavior that there is no
__getitem__ that would be non-surprising.
That said, I do think:
@price’s smaller set of Pythonic additions (and possibly more, which I imagine you’ve done) are not at all problematic;
- I think I agree that making it so the container knows when it has a scalar or a vector would probably make it possible to define a
__getitem__ that makes sense.
Another possibility might be to add a Pythonic interface to
PropertyList as a separate class (i.e. a shim layer) that you could use in the metadata extraction code, in contexts where you can reject difficult cases (i.e. can’t construct a shim with vector-valued items or nesting, because you know you don’t need them and they’re what make the Pythonic interface hard).
To avoid confusion for the initial implementation, after talking to @jbosch and @rowen , I have removed the
__getitem__ support from the initial proposal. We will add it back once daf_base can distinguish between scalars and arrays. In the mean time
__contains__ will work once I merge DM-15676. (waiting on Jenkins).