Memory management in the new butler

I’m still missing something. A task doesn’t “want” an output dataset persisted; a task user wants the output dataset persisted. The Butler doesn’t decide whether to persist a dataset or not; the Butler is configured by a user to persist the dataset or not. The task always uses butler.put(); where that goes to (it could be zero, one, or multiple destinations) is determined by the user. If a downstream task is not going to read the data, the configuration would not specify storing the data in memory.

The “user” in this case could be a human directly configuring and invoking the task and its Butler or a higher-level construct invoking the task (but ultimately under the control of some human-specified configuration).

It would depend on the situation. There are some operations (the actual scatter/gather operations we currently do with ctrl_pool: global astrometric solution, focus determination, background matching) for which I would put actual data in an RDD because the data is very temporary, only useful for the gathered processing, and it would be very inefficient to bounce it off the disk. For everything else (e.g., warp --> stack --> multiband), we want the results persisted to disk anyway, so a dataRef or similar is fine.

This is strongly connected with the evolving SuperTask design. I would really like it if we could defer this discussion until later in the week. I will not be able to do real-time posting until then. David C and I need to give our full attention to Zeljko and Mario while we’re here.

Caches and buffers need to be though about carefully. Are we assigning all cacheing responsibilities to our own code and ignoring operating system file system caches? we’ve gotten quite a it of help in certain execution environments from system people.

Here is another use case to consider: generating index files for astrometry reference catalogs. @KSK’s new design involves reading one or more reference catalogs and creating a set of “sharded” files containing objects found within a particular range of HTM IDs. However, the input catalogs can be huge.

If the butler simply provides file handles that the task can read from and write to then it is easy to see how to avoid running out of memory: the task can process one object or subset/chunk of objects at a time and avoid accumulating much in memory. However, if the butler is going to do more than provide file handles then I wonder how memory will be managed.

I’m not sure that the Butler needs to be used to access the input catalogs in this particular case. This seems like a “prep” utility that need not be executed repetitively during production.

I was wondering if you might find time to address these questions now.

Also, a new question has recently come up: what happens if a task gets data from the butler and modifies it without putting it afterwards (this is a very common thing for a task to do, since making a copy can be expensive), and then suppose a later task gets the same data from the butler? Does the later task get the modified data (because it has been cached) or the persisted data? I hope the later task receives the original, persisted data, because the later task might not even know that the data was updated in memory, which could lead to subtle bugs.

Please let us know what API you would prefer for either the first task to inform the Butler that it is changing a returned object or the second task to inform the Butler that it needs a pristine copy and not a shared one.

If every task that modifies data has to explicitly tell the butler it has done so, then I think that will be a disaster, due to forgetting to supply the notification. So I think hope that is not under consideration. (If the butler could automatically tell that an object had been modified, that would be nice, but I doubt there is a safe way to make that happen).

So…I think if we want to support getting both kinds of data: cached, possibly-modified data or pristine as-persisted, then the task will have to tell the butler what it wants. That suggests some kind of flag. I would think the only safe default is “pristine”; if you are willing to use cached data that some task may have modified then you should probably have to say so.

If updating in-place is intended as a mechanism for inter-task communication, I’m worried.

I was actually thinking exactly the opposite.

Every task will always need a pristine copy of its input datasets; getting anything else would be a provenance nightmare. Moreover, a task can’t know if previous tasks have modified a cached version in-place, so we’d simply end up always loading everything from disk instead of using the cache. It’s the task that modifies a dataset that knows it has done so, so it has to be the one responsible for indicating this.

Rather than a callback to indicate that it has modified the dataset, however, I think what we want is a variant of Butler.get that tells the butler it will modify the dataset and hence it should not be cached. I haven’t thought through all of the implications of this, so I’m not at all certain, but I haven’t yet seen any critical flaws in this approach. The problem I see with this is that it means we’ll still reload things from disk, instead of copying them in memory.

So maybe we need both a way to indicate that a dataset will be modified and a way to indicate that the same dataset will be needed later, with the task framework somehow taking all of that into account and informing the butler whether something should be loaded and copied vs. loaded and not cached.

In any case, it certainly suggests that any time we can make our objects immutable and force dataset updates to be atomic, it’s a win. The complexity in this problem makes me wonder if we also want to implement some sort of freezing and manual copy-on-write in things like images and catalogs, which we almost certainly can’t just make immutable.

I have to agree with Jim here. From a workflow management perspective, the ability to modify data, either in memory or on disk is scary and will break lots of assumptions that are being made. Code no longer is re-executable (which might happen in case of a crash). Yes the data has be read from disk again (hopefully it is still in the OS cache), but this will prevent any code from modifying the raw data that other processes will assume to be still pristine.

Most of the workflow systems that I have seen (and written) will copy the data before the code is executed, to prevent any process from modifying the data. That being said, one option is to have the request to butler be for either a Read-only ® copy or a Read/Write (RW) copy. If this option is chosen we will need to add tests that can check to see if data that is checked out as R will not be modified.

I would like the default is always read-only.

Just to be clear, we’re not talking about modifying datasets in persistent storage. Objects retrieved from those datasets might be modified in place in memory, but they’d be persisted as different datasets. The difficulty comes if they’re shared as components of other objects in memory or reused in other tasks.

I’m thinking that as soon as the object is modified, it must be dissociated from its original dataset type and data identifier; it is no longer the same thing. The question is whether (and how) to retain a copy of the original, unmodified object in memory to satisfy future customers of that dataset type and data identifier. Since the modifying task can’t know what will be run afterwards, this seems somewhat difficult.

Images/exposures and tables are my main concern. The code that gets an exposure from the butler may not know if that exposure is going to be modified by some subtask in the tree. The only safe assumption is that it will be modified (and indeed exposures quite often are).

If we can come up with a system for enforcing that these cannot be modified, or detecting that they have been modified, then it seems fairly simple to mark the cache as dirty or otherwise disassociate the data from its dataset.

I agree that this may not be the case right now, but think whether an argument is modified in place is conceptually an important part of all of those subtask signatures (even if it’s not obvious in Python code), and that means the parent task ought to be able to know this. I think this is something we need to start paying more attention to.

I’m skeptical that we can rely on documentation to enforce const. That has not worked to date, and are well into the project. Also, it is very fragile. Suppose we have a top-level task A that calls B and calls C, none of which modify their exposure. Now a user retargets C with Cprime, but it does modify the exposure. Suddenly we have undefined behavior. This makes me really nervous. I think it would be safer if the system assumed the worst, and you had to go out of your way to get any other behavior.

Tasks are going to have to start declaring more of their behavior (including use of dataset types or use of multiple threads, for example). Declaring that they modify an input is not unreasonable. The declaration would use a mechanism that is visible without even instantiating the Task.

1 Like

Is there a reasonable way to make all cached objects immutable? e.g. would it work & be acceptable to monkey patch the __setattr__ method to raise an exception if called?
At the same time maybe we could add a method to the object to get a copy of the object that the script would like to mutate (provided that the object is copyable).

Monkey patching __setattr__ wouldn’t work, and I’m skeptical in general, but it’s not entirely out of the question. The really big concern here is images (tables too, but it’s essentially the same problem for those). It’s fairly critical that we provide views (via NumPy) into those into those objects, which makes it essentially impossible to intercept modifications - unless we can sometimes provide read-only views and sometimes provide read-write views (which NumPy does support). That’s entirely natural in C++, since “constness” is a fundamental part of the language, and there’s generally no extra work to define a const version of a class. But it’s much less natural in Python, where (while it sometimes exists as in set and frozenset) it’s a lot of work to provide both mutable and immutable versions of everything.

I know something like this is intended for SuperTask, but was not aware we would be adding it to all tasks. It may help to have a programmatic way of querying this info, but without enforcement I’m still concerned.