Topics for Butler conversations at the AHM

A few of us have been discussing Butler stuff recently and have been putting off some conversations until the all hands meeting, hoping that we can find a moment to have an ad hoc face to face conversation and whiteboard stuff as needed.

To keep track of these topics, I created a list at https://confluence.lsstcorp.org/display/DM/Butler+Topics+for+AHM+2016
If you’re interested in any of these in particular, add your name under the topic and I’ll include you when we work on finding a time to meet.

If you’d like to add a topic, I think it will work if you just add it to your page (don’t forget to include your name under the topic).

I would like to understand the mechanisms currently provided by the Butler for using non-POSIX storage backends. Specifically, I’m interested in exploring the suitability of object stores (such as OpenStack Swift or Amazon S3-compatible) as repositories for LSST data.

I wonder if this topic will be addressed in one of the breakout sessions. If that’s the case, I would like to attend.

@FabioHernandez let’s make time to discuss it. I don’t think there will be dedicated sessions set up for these topics (@gpdf?), but we could use some of the time during the pairwise discussions, and/or meet informally any time during the week. If anyone else is interested (let me know), we can figure out a way to set up a more formal time.

I do remember the conversations we had re. using S3 servers at last year’s meeting in Bremerton.

There’s not any mechanism other than POSIX yet, but I’ve been thinking about it and talking about it a lot with Fritz & KT. It might be a helpful primer for you to read through the document at https://confluence.lsstcorp.org/display/DM/Butler+Storage+and+Format+Refactor

Hi there. I have a little bit of experience from a previous project operating S3-compatible storage at the near petabyte scale (specifically, Ceph’s radosgw) as a data repository and would be glad to share knowledge/experience if that is helpful.

1 Like

Just to add to the non-POSIX conversation, there is a person here at UW trying to run the command line tasks on Spark and she ended up un-butling the tasks. I would be interested in knowing whether a backend suitable for Spark is even a reasonable thing to ask for. Note that I know next to nothing about how processing in Spark works.

Hey Maria, I think it would be very good hear about that. Will you be at the meeting all week?

I’m only wikipedia-familiar with spark. I’d like to discuss this more. It feels a little funny, like an abstraction layer on top of an abstraction layer(?). But I guess using it to abstract a distributed dataset across multiple machines could be useful? Are you and/or that person available at the meeting this week? (I suppose you’ll be busy with the review through wednesday or thursday…)

My understanding is that Spark is intended to behave as if it is working on in-memory data, with its own data distribution and task assignment. In that case, there seem to me to be two alternatives: either have an essentially trivial Butler layer that just retrieves whatever data Spark already has, or change the model and use the Butler only to load data into Spark to begin with.

Thanks for pointing me to the document. I will get familiar with it for next week’s conversation.

Third option: use Spark as a parallelisation engine only. In that case, Spark would distribute dataIds and temporary products, and you would use the Butler in Tasks for I/O the same way we always have — no need to change anything except the glue in-between the top-level CmdLineTasks.

Yes, will be there all week.

Indeed that’s an option and was how I initially thought about it. It’s not clear to me how much we can do with temporary products in Spark without modifying the current CmdLineTasks, and I worry that much of the benefit of using Spark might go away in this mode.

This was one of my concerns as well.

Another thing that came up in this space is whether we should try harder to allow construction of objects from streams. Currently we don’t allow construction from byte streams for any of our low level objects. I don’t know if that was conscious decision or just an artifact of the capabilities of cfitsio.

This is mostly a cfitsio issue. I think it will most likely go away (I can think of a couple of different ways to make it do so, at varying efficiencies) when we allow back-ends like S3.

Hey Maria, are you available to meet today after the last session, at 5 or a little after? Tomorrow after 11 would also work for me (my shuttle to the airport leaves around 4)

FYI @FabioHernandez and I are planning to test butler using a repository in Swift (IE not on the local filesystem). The working plan is on confluence.