Adding pandas to sims

One problem with using pandas in sims is we haven’t really understood what is required to add it to the requirements. Pandas is provided by anaconda, so most of our users will have it (since the official LSST python is anaconda - right?) but it is not part of the jenkins build system (which uses miniconda and thus only adds packages which are approved … although this is also kind of fuzzy). There are other python packages which are not LSST supported/distributed third party python packages – notably, astropy and scipy – which currently are usable by sims, so it’s been hard to understand where the line is drawn.

The third party packages currently supported are listed at
https://confluence.lsstcorp.org/display/DM/DM+Third+Party+Software
There are three levels of ‘supported third party packages’: (a) official, distributed packages (like pyephem, healpy, sqlalchemy, pymssql) (b) officially used but not distributed packages (ds9 and ws4py – intended to be things we’re using for development but may change in the future?) and © third party packages for developer use and not distributed (like scipy).

I think all of the third party packages sims currently uses are listed there, except astropy. Note that we use scipy in an integral way in the sims stack, although it’s only supposed to be for developer use in the third party package lists above.
Note that distributed third party packages are supposed to have an eups-packaged repo in github/lsst - see https://github.com/lsst/sqlalchemy for the sqlalchemy package, this seems to be pretty straightforward in most cases). Scipy and astropy do not currently have third party eups packges.

Confluence page describing the process of how to add 3rd party packages to the stack.
https://confluence.lsstcorp.org/display/LDMDG/Adding+a+new+package+to+the+build
This suggests that first you file an RFC (see https://confluence.lsstcorp.org/display/LDMDG/Discussion+and+Decision+Making+Process) which basically means filing a JIRA ticket, at which point you’re saying you want a particular package and are willing to do the work to make it happen, and to maintain the package for LSST.
If the RFC passes (doesn’t receive any objections), then you create a third party package following the instructions here:
https://confluence.lsstcorp.org/display/LDMDG/Distributing+third-party+packages+with+EUPS … looks like we could actually do this fairly simply for pandas, if we assume that scipy stays in its “used by developers by not released - so users have to provide this themselves” box (which is kind of not how we’re actually using it but perhaps would be good enough?? note that we already require it in sims packages)
and then add it to lsstsw/etc/repos.yamls file.

So it seems like we could package up astropy and pandas like this, file an RFC, and add them to the distributed third party packages. I think to do it easily we’d have to assume scipy is user-provided (at least, this is my impression from talking to Simon). It does make me wonder why we’re taking the effort to do this for these packages (and for sqlalchemy) given that sqlalchemy, astropy and pandas come standard with anaconda. However, to support users not using anaconda, this is necessary. [do we have any sims users who are not on anaconda?]

My point here is to try to document what we’d have to do. If I’ve missed anything, please add in comments. Also, please feel free to add comments on what other paths forward we might follow – sims is actually a bit different from DM, but since sims packages are built with jenkins (and we want them to be built with Jenkins), then we have to make sure to support that use too.

@KSK @connolly @danielsf

I have a vague memory from when the Sims team gathered in Tucson in March 2015 that Josh Hoblitt said we wanted to have a totally independent Jenkins system for Sims.

Am I making that up?

Is this still a plan?

If so: is this a reason to try to push a little harder for that (so we can put whatever we like on our Jenkins without mucking up the DM build)?

Some relevant links. RFC-50 is the big, unresolved, discussion on requirements.txt vs EUPS packages. That RFC should probably be escalated to @ktl.

Also the PR that I approved but which has not been deployed because there were issues with Qserv depending explicitly on the EUPS anaconda package (@josh would updating the anaconda package fix the problem in the short term?) :

I’m pretty sure it’s not official. newinstall.sh even asks you which python you want to use. We allow any python to be used so long as it’s the right version and has numpy in it. Some people like to use their system python and anaconda is convenient for others. This is why one option is simply to state which conda or pip packages are pre-requisites and don’t involve them in the EUPS side at all.

It is simple unless pandas depends on other python packages. Those packages also have to have EUPS packages created for them.

Interesting. RFC-50 looks like it’s pretty dead (last comment in August 2015?). It does sound like not such a bad idea, but it also doesn’t sound like it would get resolved all that quickly.

Astropy and pandas are pip-installable, but there are other python packages which are pip-installable which we re-package and install that way. (this is what RFC-50 is all about, in the end, right?). So we can decide to say to users, ‘pip install pandas and astropy’ – and figure out what to do about jenkins, or we can eups package them.

Pandas requires scipy … there are already complications about scipy and we should probably just make it a pre-req for the stack (like we do for numpy and matplotlib). If that’s out of the way, packaging pandas might be ok.
Astropy depends on a lot more python packages and thus would be harder to package. (actually - wow, I didn’t realize there were so many!). I actually wouldn’t want to do this.

Maybe. If we do, we should think about how we want to support people not on anaconda. Saying pip install these X prerequisites may be enough.