Memory management in the new butler

I’m curious about the plans for memory management in the new butler, and in particular about when objects will be cached and not cached. Apparently the preferred way for one SuperTask to call another (where the second is a sub-task) is for the first to use butler.put for its outputs, and the second to use butler.get instead of passing the data directly. Such data must be cached, both for performance and because it must work even if the first outputs are not persisted. However, overly aggressive caching will lead to memory problems. For example a task that produces coadd temporary exposures may write many exposures with one call to run_quantum. If these are cached, we will quickly run out of memory. Another example is the coaddition task, which reads many exposures that must not be cached.

In case this issue has not been thoroughly thought through already, I present some ideas…

A very simple solution is to have the butler never cache data, and instead pass the data directly, just like calling any other sub-task. This requires the parent task to pass the input data, and also pass a butler and data IDs so the sub-task can persist its outputs. It makes the code slightly messier, but it quite direct and obvious. It also preserves the paradigm that there is only one way to call a sub-task. For a pre-supertask example see the new ProcessCcdTask, which calls several command-line tasks.

Another solution is to have the butler only cache data products that are written by one SuperTask and read by a later one in the same run. The butler will have to be careful to identify those products correctly.

I prefer the simpler solution. It may lose one of the advantages of the new SuperTask architecture, but from my perspective it retains a primary advantage: SuperTasks register their inputs and outputs with the butler, providing a nice way to see which products are read and written by which tasks.

I’m interested to hear what @ktl has to say on the subject. He certainly knows more about the topic than me, but as to how this could be done with Butler I have more questions than answers. I think there may be ways, but I need to understand the constrains before suggesting ideas. It would help me to do some whiteboarding with someone.

From the hip chat conversation it sounds like @gpdf would also like to participate, but can’t do so until later this week.

The initial thinking was that the pipeline user could determine via configuration whether a given output dataset type was written to persistent storage, memory, or both (or none). (This is configuration in a general sense; I’m not sure yet if its best in a repository config or in a mapper policy.) The pipeline should work with any of those configurations except “none”, of course with different performance implications. The Butler would not itself try to determine whether a given output dataset type should be cached in memory.

If that’s not workable, please let us know.

I’ve been thinking about this from a slightly different direction because of my interest in Apache Spark and similar execution environments. If the pipeline is expressed as a DAG, then the pipeline can know when products need to be re-used, and hence need to be cached.

That doesn’t help with the smaller scale problem of subtasks within a task, but I don’t see how you solve that without devolving to some sort of clipboard configuration or a lot of introspection.

@price, would you use an RDD to hold actual data or just dataIds/dataRefs?

(Might be out of scope for this thread; feel free to split if desired.)

The issue I’m concerned about is using the butler to as a “clipboard” to pass data between sub-tasks that happen to be super tasks. That raises caching issues that are go beyond whether a particular task wants a particular output dataset persisted. If the butler is automatically used as a clipboard then a task will always use butler.put on all outputs, whether they are to be persisted or not, but the butler has to keep track of data (after deciding whether or not to persist it) only if another task is going to read it. Or, as I said, we can punt on the issue and just have the calling task pass the data directly to the sub-task.

I’m still missing something. A task doesn’t “want” an output dataset persisted; a task user wants the output dataset persisted. The Butler doesn’t decide whether to persist a dataset or not; the Butler is configured by a user to persist the dataset or not. The task always uses butler.put(); where that goes to (it could be zero, one, or multiple destinations) is determined by the user. If a downstream task is not going to read the data, the configuration would not specify storing the data in memory.

The “user” in this case could be a human directly configuring and invoking the task and its Butler or a higher-level construct invoking the task (but ultimately under the control of some human-specified configuration).

It would depend on the situation. There are some operations (the actual scatter/gather operations we currently do with ctrl_pool: global astrometric solution, focus determination, background matching) for which I would put actual data in an RDD because the data is very temporary, only useful for the gathered processing, and it would be very inefficient to bounce it off the disk. For everything else (e.g., warp --> stack --> multiband), we want the results persisted to disk anyway, so a dataRef or similar is fine.

This is strongly connected with the evolving SuperTask design. I would really like it if we could defer this discussion until later in the week. I will not be able to do real-time posting until then. David C and I need to give our full attention to Zeljko and Mario while we’re here.

Caches and buffers need to be though about carefully. Are we assigning all cacheing responsibilities to our own code and ignoring operating system file system caches? we’ve gotten quite a it of help in certain execution environments from system people.

Here is another use case to consider: generating index files for astrometry reference catalogs. @KSK’s new design involves reading one or more reference catalogs and creating a set of “sharded” files containing objects found within a particular range of HTM IDs. However, the input catalogs can be huge.

If the butler simply provides file handles that the task can read from and write to then it is easy to see how to avoid running out of memory: the task can process one object or subset/chunk of objects at a time and avoid accumulating much in memory. However, if the butler is going to do more than provide file handles then I wonder how memory will be managed.

I’m not sure that the Butler needs to be used to access the input catalogs in this particular case. This seems like a “prep” utility that need not be executed repetitively during production.

I was wondering if you might find time to address these questions now.

Also, a new question has recently come up: what happens if a task gets data from the butler and modifies it without putting it afterwards (this is a very common thing for a task to do, since making a copy can be expensive), and then suppose a later task gets the same data from the butler? Does the later task get the modified data (because it has been cached) or the persisted data? I hope the later task receives the original, persisted data, because the later task might not even know that the data was updated in memory, which could lead to subtle bugs.

Please let us know what API you would prefer for either the first task to inform the Butler that it is changing a returned object or the second task to inform the Butler that it needs a pristine copy and not a shared one.

If every task that modifies data has to explicitly tell the butler it has done so, then I think that will be a disaster, due to forgetting to supply the notification. So I think hope that is not under consideration. (If the butler could automatically tell that an object had been modified, that would be nice, but I doubt there is a safe way to make that happen).

So…I think if we want to support getting both kinds of data: cached, possibly-modified data or pristine as-persisted, then the task will have to tell the butler what it wants. That suggests some kind of flag. I would think the only safe default is “pristine”; if you are willing to use cached data that some task may have modified then you should probably have to say so.

If updating in-place is intended as a mechanism for inter-task communication, I’m worried.

I was actually thinking exactly the opposite.

Every task will always need a pristine copy of its input datasets; getting anything else would be a provenance nightmare. Moreover, a task can’t know if previous tasks have modified a cached version in-place, so we’d simply end up always loading everything from disk instead of using the cache. It’s the task that modifies a dataset that knows it has done so, so it has to be the one responsible for indicating this.

Rather than a callback to indicate that it has modified the dataset, however, I think what we want is a variant of Butler.get that tells the butler it will modify the dataset and hence it should not be cached. I haven’t thought through all of the implications of this, so I’m not at all certain, but I haven’t yet seen any critical flaws in this approach. The problem I see with this is that it means we’ll still reload things from disk, instead of copying them in memory.

So maybe we need both a way to indicate that a dataset will be modified and a way to indicate that the same dataset will be needed later, with the task framework somehow taking all of that into account and informing the butler whether something should be loaded and copied vs. loaded and not cached.

In any case, it certainly suggests that any time we can make our objects immutable and force dataset updates to be atomic, it’s a win. The complexity in this problem makes me wonder if we also want to implement some sort of freezing and manual copy-on-write in things like images and catalogs, which we almost certainly can’t just make immutable.

I have to agree with Jim here. From a workflow management perspective, the ability to modify data, either in memory or on disk is scary and will break lots of assumptions that are being made. Code no longer is re-executable (which might happen in case of a crash). Yes the data has be read from disk again (hopefully it is still in the OS cache), but this will prevent any code from modifying the raw data that other processes will assume to be still pristine.

Most of the workflow systems that I have seen (and written) will copy the data before the code is executed, to prevent any process from modifying the data. That being said, one option is to have the request to butler be for either a Read-only ® copy or a Read/Write (RW) copy. If this option is chosen we will need to add tests that can check to see if data that is checked out as R will not be modified.

I would like the default is always read-only.

Just to be clear, we’re not talking about modifying datasets in persistent storage. Objects retrieved from those datasets might be modified in place in memory, but they’d be persisted as different datasets. The difficulty comes if they’re shared as components of other objects in memory or reused in other tasks.

I’m thinking that as soon as the object is modified, it must be dissociated from its original dataset type and data identifier; it is no longer the same thing. The question is whether (and how) to retain a copy of the original, unmodified object in memory to satisfy future customers of that dataset type and data identifier. Since the modifying task can’t know what will be run afterwards, this seems somewhat difficult.

Images/exposures and tables are my main concern. The code that gets an exposure from the butler may not know if that exposure is going to be modified by some subtask in the tree. The only safe assumption is that it will be modified (and indeed exposures quite often are).

If we can come up with a system for enforcing that these cannot be modified, or detecting that they have been modified, then it seems fairly simple to mark the cache as dirty or otherwise disassociate the data from its dataset.