Shared stacks and debugging use-cases

jbosch · February 17, 2016, 11:05pm

I’ve seen a number of mutterings on HipChat about how badly broken the shared stack is on lsst-dev; some excerpts:

The three big problems on lsst-dev are: 1) shared stacks are too big and slow, 2) eups tags (besides bNNNN) not automatically applied to ~lsstsw versions, 3) interesting tags not automatically copied to a usable smaller stack.

@jdswinbank (on the unavailability of IPython on lsst-dev):

Not sure whether I’m more shocked that our core development system is so broken or that, apparently, nobody noticed.

I’ve also seen complaints from @mwv, @rowen, and @merlin that I can’t be bothered to find just now, and it’s been apparent that in the process of validating DM-4692 it’s been a real impediment to progress that we don’t have a usable shared stack on a single development cluster (though this goes well beyond just DM-4692; it’s only the most recent reminder).

Well, actually, what’s been a problem is that we don’t have a solution for this situation:

Developer A builds a stack containing some some branches he/she is working on, runs some test data through that stack, generating an output repository somewhere, and discovers a problem that Developer B might be able to help with, so…
Developer B needs to look at the output repository, re-run the same pipelines with some additional changes to code or configuration, and be able to plot and display the results.

It’s crucial that Developer B not have to transfer any data between machines or compile any code beyond what they need to override relative to Developer A’s environment; frequently Developer B is busy and should be spending most of their time on something else, but with easy access to the Developer A’s environment, they might be able to unblock Developer A quickly.

The solution we’ve used on the HSC side (and in the now-somewhat-distant past on shared LSST machines) is to have a shared stack on a single beefy machine, like the one we’ve tried to put together on lsst-dev. This has worked quite well for us, after some initial overhead and training: everybody has to be vigilant about group access via chmod and suid, know their way around EUPS tags, and setup some way to do remote display via (probably) IPython and ds9. I think the key differences between the situation on tiger and the situation on lsst-dev are:

@price devotes a lot of time to serving as HSC release master, vetting, publishing, and installing new stacks. That let’s us to make real releases with meaningful (for developers, not just managers) version numbers on timescales determined to be useful by a human (often once every few months, but sometimes as often as once a week).
We use EUPS versions (with umbrella packages), not tags, to designate releases - so the number of active tags in the shared stack is extremely low.
We rigorously control more of our third-party packages via EUPS, as LSST used to do.

I’m not sure all of these are important to the success of this development model (especially the last point), but I think we’d need to take some of these steps if we want to make lsst-dev similarly usable. And I think we really need to do that. Or…

…we need to find some other solution to do the workflow described above. I know we have a lot of people more excited about Nebula than shared stack management and EUPS reimplementation, and I got the impression at one point that that might also provide a solution to this problem. If that’s the case, I’d love to hear more about how it might work, and we need to get everyone through the overhead/training process for whichever approach we adopt, so the next time Developer A gets stuck, he/she hasn’t already wasted effort by working in the wrong space.

jalt · February 18, 2016, 4:55pm

The interesting thing about the use of Nebula is that, effectively for the example you provided, each developer masquerades as the same user (centos or cloud-user) but within a different sandbox (VM). So, if developer A invites developer B into their sandbox, it is as easy as copying in a public key (no need for group permission tracking, etc). The complication is in what it may mean to have a user with multiple versions of the stack in one sandbox.

I am very interested in what it means to have a stack that is ‘too big and slow.’ This sounds like a stack-specific issue that should be tackled before moving to new environments.

frossie · February 19, 2016, 12:01am

Right, so it would be really ideal to move away from the “party machine” stack. At the most basic level we need a workflow model that allows interaction with developers that don’t have NCSA credentials.

Like Jason said, the right way of sharing when troubleshooting is to let someone in your private instance.

As we speak I am testing the creation of OpenStack images and containers based on the weekly. We’re still (actively) working on that service and its documentation but you can get an idea from

http://sqr-002.lsst.io/en/latest/

It’s really very easy and I think it’s a more generalised solution than a party account on a single machine.

ktl · February 19, 2016, 2:40am

I think access to large datasets from Nebula (whether via NFS or some other means) will be required before this can work well. Having a commonly-accessible place to write results back to may also be useful.

jalt · February 19, 2016, 3:19pm

This is something we wish to provide once our purchasing contract is in place. Read-only datasets to Nebula is a real possibility. Writing back the datasets directly is not doable; that is asking for problems. I would think in general, Nebula is a good home for focused developer work and I would not expect the results to wind up back in a shareable medium. Instead, I would think that once someone is ready for producing data that is shareable, they could generate it via the upcoming verification cluster or one of the condor installations or some other TBD mechanism.

jbosch · February 19, 2016, 4:41pm

I should clarify what I mean by “stack” here: what’s slow here I’ll call a “EUPS stack” - a particular shared installation of our software containing many versions of our software, for use by multiple users (it’s not that our software itself is too big or slow - that might be true, but it’s not a critical problem right now). The short summary of what’s going wrong is that the EUPS database that manages all those versions doesn’t scale well to large numbers of packages, versions and especially tags. So while individual per-user EUPS stacks work well, because individuals can keep the number of versions and tags small, our shared stacks quickly bog down as it becomes painfully slow to set up a particular version of the stack.

Aside from the future problem of dealing with less-trusted users, the advantage of this over a traditional Unix multi-user, group-permissions environment isn’t obvious to me. I can imagine it makes it easier to manage competition for resources and test on multiple OSes, but as useful as those are for CI, I don’t think they’re pressing needs for most developers. And coming from that more traditional environment, letting others have access to my user account even on a throwaway VM seems weird and somehow wrong.

My reading of this is that we will need EUPS shared stacks going forward, and we can’t rely on Nebula to do the workflow at the top of this thread. In fact, it sounds like Nebula might provide a more convenient environment in some ways that working on our own laptops, but really doesn’t meet our collaborative development needs at all: essentially all of our collaborative development involves producing data that needs to be shareable, even if it isn’t initially obvious that a particular run will have to be.

jalt · February 19, 2016, 6:07pm

I was implying the opposite. When I think about shared data that is the result of a change I have made as a developer or any other change that would serve as input to another (including QA), then I think in terms of something along the lines of RFC-95 where shared data sits in managed storage, shared but safe-guarded from accidents. I was thinking that the creation of these datasets was not necessarily something you would want to do on a shared resource like lsst-dev else you risk impacting your neighbor. Instead you would take that run to designated resources (condor/batch) and produce the output data there.

Nebula has a strict advantage over laptops (or will soon); it will have read access to the RFC-95 managed data sets if we take advantage of the incoming storage.

jbosch · February 19, 2016, 6:29pm

I think I agree with all of this, actually (including the parts I didn’t quote). But there some important pieces missing from this description that earlier comments in this thread made me think were impossible with this model; hopefully that’s wrong.

When a user puts together a custom version of the software (corresponding to a particular issue branch, probably) in a Nebula instance - this may include code that hasn’t even been committed in git, let alone pushed or published - they need to be able to run that in ways that will produce fairly large outputs and require, say, a few 10s of core-hours without having to rebuild their software stack on some other system. I was imagining that these outputs would go into some sort of sandbox space (a few 10s of TB for the whole team, maybe?), in a location that can be read by other users (those are the “rerun” directories of RFC-95). These sandboxes also need to be relatively long-lived (one user may create a rerun that other users run on top of for months). My impression was that write access to a sandbox this large and long-lived from Nebula is not feasible, and I wasn’t sure if access to a larger number of cores is (without packing up your software environment and recreating it elsewhere).

Clearly, when we run at very large scales, we need to be using versioned, published versions of the software. But there’s a lot of overhead to doing that, and we need to have some ability to scale up processing and sandbox usage before it becomes necessary to package up our software environment and recreate it elsewhere.

Maybe one piece I’m missing is read access to Nebula instance storage from batch cluster nodes, so we could build the software on a Nebula instance but run moderate-size jobs with it on other resources?

ktl · February 19, 2016, 9:35pm

I was hoping that Cinder volumes might provide a mechanism for exchanging intermediates and rerun-style processed data between Nebula instances and developers, but that mechanism (which, as the documentation says, can be thought of as a USB drive that can be attached to one instance at a time) may prove too constraining.

mwv · February 21, 2016, 2:00am

How do I interact with github.com as me and maintain my set of commits and branches when I’m in someone else’s sandbox?

How do I eups declare and undeclare stuff without messing up the environment of my host when I’m in their sandbox?

How do I maintain my user workflow customizations (aliases, editor choices, linter, etc.) when I’m in someone else’s sanbox?

jalt · February 21, 2016, 8:12pm

Right, cinder volumes are only mounted in a single location. We would need to build on top of that (NFS, etc) or provide an alternative.

From the developer A / developer B example, I had the impression that this was more of a 15 minute interruption in the life of developer B in one of those “could you come look at this issue” moments. Clearly from the discussions that followed, there is also need to easily share resultant data without ‘packing up’. So your points are taken.

mwv · February 24, 2016, 7:15pm

@frossie
You’ve mentioned that what you’ve been running on lsst-dev is actually a very
simple set of 3-5 commands and you offered to following up with @jdswinbank
with the instructions. I would like to suggest that perhaps you could
just post the commands here in this thread for ease of common reference,
and then follow up with @jdswinbank

[Sorry, I meant to post this in this thread.]