Go all in with anaconda?

frossie · April 14, 2016, 3:27am

This is a pre-RFC to test the waters.

The reality is, we have a lot of issues that originate from the wide range of user environments convolved with a relative brittleness of the stack to third party changes convolved with oh god python. Even in the relatively small LSST-land world, they are a constant source of requests for user support. I do not believe that DM shipping third parties is the right solution - it’s a “and now we have two problems” situation, and if I had the authority to veto it I would.

Initially our approach at SQuaRE was to take the environment as close as possible to distro systems on a small number of OSes and tie into their existing package mechanism (rpms, homebrew, whatever) for user support, and package running environments as VM/containers for users that desire a slower moving “off the shelf” experience.

Aside from the difficulty we had decomposing the build (some of the issues have been resolved, not all) the other thing that happened was that @mjuric did an anaconda-based distribution to help out Sims who are ahead of DM in the curve of user encounter. It is obvious that from the user point of view, this is a highly successful solution - it tends to be familiar to users, it plays well with other popular packages in this space, it’s not an in-house solution (danger Will Robinson) - it’s not even an astronomy ghetto solution, and it presents a single environment irrespective of native OS. Importantly for us, it would also allow us to move the stack build/install process closer to a community standard.

After serious discussion, and in a “least bad choice” spirit, SQuaRE is thinking of RFCing a proposal that we move to anaconda as the only supported environment for distributing the stack.

There are disadvantages/risks. Anaconda is supported as a goodwill/promotional gesture by Continuum Analytics. They don’t have a business model that would allow us preferential treatment as far as I can tell. My own personal could-be-so-so-wrong guess is that anaconda has sufficient legs by now that if CA lost interest, an equivalent community effort would spring up.

Moreover, conda has proven to be a fragile shipping platform on Linux in my preferred mode of letting the dependencies float and we are being forced to mitigate that by various strategies. The OSX experience has been better, and the reality is that OSX is the dominant user-oriented platform. SQuaRE would not have be advocating this solution if the stack was only ever going to run inside a factory, but that’s not the case here, nor do we want it to be. What we know is that we want to be CIing something that is as close as we can get to a user environment - not just their stack environment, all their environment, and anaconda is a good solution for that.

This is not an RFC, but a solicitation of discussion that would allow me to construct a good RFC, capturing the arguments on both sides. I am particularly interested in views from Architecture, since besides SQuaRE they are the team which most gets embroiled in user support issues.

I will also say that I don’t feel the current situation is tenable; if we decide not to go down this route (and trust me, I’m torn) I am going to revert to advocating shipping a containerised runnable environment. However so far the median user seems to not be inclined to go down that way, and also there are disadvantages in terms of appealing to people who already have a productive environment that they want to mix and match (eg astropy).

For the users, it seems like the best option; and I want to be dogfooding what I ship to our users.

parejkoj · April 14, 2016, 4:23am

Interestingly, Simon and I were just discussing this, as it related to trying to build SuiteSparse (which depends on BLAS… which comes “free” with Conda). Having all the “standard” python packages and their dependencies managed by Anaconda (and one can get version records from that, so reproducible pipeline processing is still fine) could simplify several things.

The definite flip side would be “vendor lock-in.” Just a few years ago, everyone was recommending Enthought, but that’s pretty much faded from the numerical computing world, and in much less time than the LSST stack will have to survive. How much of Anaconda is open source, and could it be rebuilt from what is available without too much effort?

Could we “buy in” to Anaconda as a way of ensuring their long-term survival and to get us support?

swinbank · April 14, 2016, 11:57am

I confess, my first reaction is not enthusiastic: my experience is that Anaconda itself has been a persistent source of issues (DM-1801, DM-2575, DM-5105, DM-5595, etc). While I’m obviously not on the front line of user support myself, anecdotally it seems to very often be involved when something breaks. My overall confidence in its quality is pretty low. (And it doesn’t help that a regular topic of discussion in the coordination meetings is all the SQuaRE effort that bootstrapping on Anaconda has been absorbing.)

That said, could somebody expand on how and to what extent Anaconda could address versioning issues within the stack? For example, DM-5779 is on my mind at the moment. Would we tie the stack to a particular version of Anaconda? Would we guarantee to always support the latest release? Would we require the latest release (judging by https://gist.github.com/mjuric/1e097f2781bc503954c6, that’s what Mario’s Anaconda packages do)? What happens when Anaconda ships a new NumPy that breaks the stack (DM-4063)?

How would a proposal to require rely on Anaconda impact on potential AstroPy integration, which makes no such requirement? Would it hurt efforts to make as much of our code as possible pip installable? (I suspect the answer here is “it wouldn’t”, but it would be nice if somebody could reassure me.)

jbosch · April 14, 2016, 2:22pm

My initial reaction is somewhat negative, too - having heard so much about the trouble people have had with Anaconda, I’ve actually avoided using it on everything but the shared stacks at NCSA so far. So I’m a bit surprised that the proposal is to go deeper into Anaconda rather than backing away from it. But my avoidance of Anaconda has implied avoidance of detailed discussions about Anaconda, so I’m really not informed on the cost/benefit analysis.

Perhaps more importantly, it’s not clear to me what we’re proposing to replace with Anaconda. If this is just dropping “eups distrib install” for 3rd-party packages in favor of conda packages for these, I could imagine scenarios where that’s a good move - as long as we can still use eups to manage those dependencies after they’re installed (and hence people like me can continue to use eups-declared OS packages or custom installs to satisfy those dependencies instead of using Anaconda).

If we want to move more things into the state our Python and NumPy dependencies are in - in which EUPS doesn’t know about them at all, aside from perhaps confusingly-versioned-and-named dummy packages - then I’m even less in favor; I think I’d much rather move in the opposite direction, and eups-declare products provided by other package managers rather than simply ignore those dependencies in EUPS.

mwv · April 14, 2016, 3:21pm

I’m opposed to switching over our development model to require Anaconda to use the LSST DM stack.

I’m very much in favor of using conda to distribute binaries as one useful way of obtaining binary distributions of the software.

KSK · April 14, 2016, 3:48pm

In my mind this is almost completely about resolving some of the dependency hell we get ourselves into. As @parejkoj said, this has been in our mind, as well. In porting jointcal, we find one of the packages used depends on a package that depends on BLAS. At this point the options are:

Eups distrib install BLAS
Assume BLAS is a system dependency

The first option is not attractive because the BLAS build system is complicated and other distribution systems do it better (rpm, yum, etc.). The second is not great because it puts the onus on the user to deal with a potentially difficult install (though most OSs have a reasonable way of installing BLAS).

John noticed that Anaconda comes with BLAS shared libs because of scipy and numpy. This is a side-effect, but if we can depend on the environment, it’s a side-effect we can count on. In this situation, there is really no difference between an Anaconda environment and a containerized thing, but in general, I prefer an anaconda like solution because it allows the user to extend the environment if desired.

In any case, the third party dependency issue is going to continue to get worse. In my limited thinking the “right” solution to this is to make all third party dependencies system dependencies, but to make the stack much more modular and only require the system dependencies that are needed by the pieces the user will ultimately need. In this example, we could require that jointcal depends on the system to provide BLAS. The key would be to allow the stack to build and test without BLAS and only complain if someone tried to use jointcal.

Anaconda has had it’s issues, but in terms of environment sanitization, it seems like the only game in town. Because of the extensibility issue, I would vote for Anaconda over containerization.

jbosch · April 14, 2016, 4:35pm

I second this as the direction we want to go, though I think the limiting factor is not so much the stack’s modularity as our ability to find third-party packages installed various ways on various platforms. And I think we need to move away from hiding “system dependencies” from EUPS; I think things will get much easier if we instead try to use EUPS to declare system dependencies.

To that end, I think I’d be content if we had these two options for installing third-party packages:

Easy, but limited: install Anaconda, including some LSST-provided conda packages for third party packages.
Do it yourself, using whatever combination of system packages, pip, homebrew, etc. you need. We don’t support everything here, and we may explicitly not support some exceptionally weird configurations.

…with the big caveat that no matter which of these you chose, we have tools that can scrape your system and declare EUPS products (with versions) for whatever is there, and warn you about things it couldn’t find. That’s conceptually something like a big configure script for the latter case (though I’m not suggesting we’d actually want to use autoconf to do it), and something that queries conda for the former.

frossie · April 14, 2016, 5:15pm

Thank you for the useful discussion.

I am going to get to some of the points later when I have a bit more time but I just wanted to say something to @jbosch’s point: the problem with the “do it yourself” solution is that we don’t even know which ones of the “do it yourself” options work (we can’t test everything in combination with everything) and as long as we imply that some level of DIY environments are expected to work, people quite reasonably come for help.

Aside from the effort required (which fine, I could ask for more effort) I am also concerned about the transition to operations: the user demands for support are likely to go up as the staffing (and construction expertise) goes down. In general in SQuaRE we strive to be in a position to “gift” to operations low-maintainance systems with easy upgrade paths, so that if we are put to pasture we won’t have operation-era people spitting on our virtual graves (trust me, you should see what it looks like from the other side).

I don’t disagree with the comments that anaconda is in itself a source of brittleness. The real question in my mind is whether we are better off putting all our resources in making one thing work. I don’t know the answer.

I complete agree with @KSK that there are fundamental parts about the stack build architecture that are making it look like a monolithic dependency chain, which is an issue. It means that when something breaks (see the mariadbclient client) EVERYTHING breaks so every issue is an emergency that slows down planned work. I feel like I have had no traction appealing to DM management to tackle on this issue and I am sceptical based on previous experience that an RFC can pass as it would undoubtedly escalate.

I hope to get to some of John’s questions later, as they need Actual Work

jbosch · April 14, 2016, 5:22pm

That’s entirely fair, and I’d be quite okay with putting all kinds of scary warnings about lack of support on this sort of approach. I guess I’m just asking that we don’t make it impossible to DIY without changing code or table files, so I can keep my personal extremely-special-snowflake (and Anaconda-free) environment.

timj · April 14, 2016, 5:24pm

Do STScI/Gemini/Astropy have an opinion on Anaconda? @stsci.perry, @etollerud, @jturner or @thomas.robitaille may have community insight on this.

stsci.perry · April 14, 2016, 7:18pm

I’m not nearly involved in this area as I used to be. But I will add a few comments.

There is no universally good solution to all the problems that arise in a large system that has to work on multiple platforms with many third party packages, many of those with their own dependency requirements. As I understand it, this is an NP-Complete problem and has long resisted a perfect solution for good reasons.
While Anaconda is vendor specific, the machinery is open source, if not as well supported in its use as it could be; but that has been improving, and it is seeing wider and wider adoption. If ContinuumIO went away, I think there is a reasonable chance that there is a sufficiently large community around to help keep support going. But you can pay them for support, and that certainly would help get quick attention. In contrast, Enthought’s distribution did not make the tools for supporting distributions freely available so it really wasn’t acceptable to us for that very reason.
The core python dev community does not make this case a high priority, and last I heard (a couple years ago) they were recommending Anaconda as the recommended approach to solve complex distributions.
STScI is switching to using Anaconda/Conda as the basis of its software distributions (we previously had used a solution we came up with in conjunction with Gemini called Ureka). But we also say that users can always install all the dependencies themselves if the constraints of the distribution we provide don’t meet their needs. But they better be comfortable doing complex installations and debugging what goes wrong.
It seems to me the question that LSST must answer is not whether Anaconda is problem-free, but whether there is any other approach that is better (other than Virtual Machines and similar solutions, which have their own issues from the user perspective). I can’t really speak to that issue for LSST. Regardless of what you do, users will have problems with installations. Ideally, you can tell them what works well, what you don’t support, and whatever. And still, many won’t read those or follow those guidelines. And many more will have things they are completely unaware of screw up their installations (e.g., forgotten things in their start up files) or odd aspects of their particular system. We deal with those kinds of things all the time. I sure wish it were easier to deal with.

I’m asking our person that handles the details of this for comments as well (or corrections :-).

jturner · April 14, 2016, 7:41pm

I would tend to think this is a good idea. Admittedly, I don’t know the LSST stack well and even my practical experience with Anaconda is rather limited, but we’re all trying to solve similar problems here. We (STScI & Gemini) created Ureka before Anaconda was first announced because there wasn’t an existing solution that met our needs. Ureka has been successful but we’re in the process of moving towards Anaconda now because, as Frossie says, it’s unnecessary hard work to maintain the entire thing when there’s now a popular solution in the wider community, backed by more resources. Moreover, Anaconda is clearly more popular amongst the non-IRAF-using developer types and I think there’s something to be said for co-ordinating our stack for astronomy.

It doesn’t seem fair to compare Anaconda with EPD, which was a proprietary distribution that is less extensible. We considered EPD early on and rejected it for those kinds of reasons (we did briefly try using Sage, but that turned out to be quite brittle). It’s probably true that Anaconda is imperfect and somewhat dependent on the goodwill of Continuum but it’s open source (at least the main distribution) and there isn’t any equally viable alternative besides keeping on rolling our own. In the unlikely event that it becomes non-viable in future, we can still revert to doing that, especially if we have more collaborators… It’s not impossible. Also, Anaconda is managed by people we know (Perry in particular knows), who have behaved in the interest of the community in the past and who are actively pushing it as a science community solution. I am, however, also wondering how supportive they will be of our Anaconda-based effort (eg. they didn’t initially let us know they were working on it at a meeting where we announced Ureka and spent time talking to them).

Regarding things like NumPy breaking the stack, we could potentially put out our own NumPy/SciPy build in that case, to install as part of our “astroconda” environment, which complicates things but would still leave some of the heavy lifting (eg. package management & libraries like Qt) to Continuum. We just don’t have to do it as a matter of routine and we don’t have to do all of the integration testing ourselves. I have been wondering how easy it would be to reproduce Anaconda from scratch but it doesn’t seem hard to rebuild selected packages (I’m just getting up to speed with Joe at STScI).

If you are interested, this is probably a good time to discuss co-ordinating on some common or overlapping Conda-based stack. I can’t speak unilaterally on behalf of Ureka/AstroConda because STScI has been doing most of the work but I think we would be happy to collaborate more on this. In a parallel development, Matt Craig has started maintaining a Conda channel for AstroPy affiliated packages and a more general OpenAstronomy channel, which we might also co-ordinate with to layer our observatory packages on top of some general astronomy package set (TBC). Again, you can always override bits in your own Conda environment if you need to; it probably won’t do everything your EUPS does but a similar approach has been working well for our users (& even for Gemini operations).

Cheers,

James (at Gemini South).

jturner · April 14, 2016, 7:55pm

I didn’t see Perry’s post before replying below but I think I agree anyway.

Yes, as a user, I don’t particularly want to be stuck in a VM, nor do I want, in my support role, to troubleshoot arbitrary combinations of things that users have installed in their OS themselves. A controlled and tested but extensible distribution of native packages with minimal OS dependencies seems as good as you can get. Anaconda is what the community seems to have been settling on, as the first sufficiently general distribution to solve that problem.

parejkoj · April 15, 2016, 6:27pm

Michael: could you elaborate on why you’d prefer we don’t require Anaconda?

mwv · April 15, 2016, 6:39pm

Because I have a variety of Python packages and package installation setups across several machines. I want to be able to use custom packages from different sources and existing Python installations while using the LSST stack.

Having a forced Anaconda environment that the DM Stack controls and is the only environment I can use it in would cut off my ability to use tools. I want to have the full array of tools that I’m familiar with available; I don’t want to have to switch around to different virtualenvs or Anaconda installs.

mwv · April 15, 2016, 6:46pm

I’m not asking that DM provide comprehensive support for using Homebrew, fink-based Python installs (my desktop and laptop, respectively). Or to support all Canopy-based Python installs (our local astro computing cluster).

But I am asking that it not be difficult for me to support myself running the DM stack on those configurations if I’m already familiar with them.

This is the use case outlined in @stsci.perry 's comment #4 above.

STScI is switching to using Anaconda/Conda as the basis of its software distributions (we previously had used a solution we came up with in conjunction with Gemini called Ureka). But we also say that users can always install all the dependencies themselves if the constraints of the distribution we provide don’t meet their needs. But they better be comfortable doing complex installations and debugging what goes wrong.

I realize I’m responding to “only supported environment” + “dogfooding” and interpreting that to mean that this preproposal is to force all of the devs and related interested power users to use this Anaconda set up as well.

etollerud · May 3, 2016, 11:06pm

Very late on this… (still learning to use community.lsst.org …). But I think my answer depends on what “supported” means. That is, if “supported” means “this is how we suggest you do it, but we’ll try to make it possible to work with other things if we can reasonably” then I think that might meets @mwv’s needs without requiring undue support needs.

After all, anaconda isn’t a particularly “special” environment, so if it works in anaconda you should be able to massage it into your own favorite environment.

Very roughly speaking this is how we tend to support distributions in Astropy - much of the dev community uses a few particular environments, but we help with more situations as time/effort allows.

timj · May 4, 2016, 12:38am

but you can cut corners when you know it exists and you can simply assume things like MKL and where certain libraries will be found.

etollerud · May 4, 2016, 4:38am

but you can cut corners when you know it exists and you can simply assume things like MKL and where certain libraries will be found.

Sure, that’s true, and I think it’s fine to say “we won’t help you install the MKL, so you’ll have to figure it out”. I just mean it’s not “special” in the sense that if the user wants to spend their time doing that they can, because there’s nothing in Conda that can’t be installed by other means (except the actual conda manager itself).