EUPS distrib "tarball" binary packages

We’re announcing a preview of our stack binary releases for OSX and Linux (centos 6 & 7) platforms.

Try it out

Users may try out a binary installation with the w_2017_35 weekly release.

First, ensure the platform pre-reqs are installed.

Then get and run the revised newinstall.sh script:

curl -sSL https://raw.githubusercontent.com/lsst/lsst/master/scripts/newinstall.sh | bash -s -- -cbt

The -t flag enables “tarball” installations.

Then load the environment and install the latest weekly as usual:

. ./loadLSST.bash
eups distrib install lsst_distrib -t w_2017_25 -vvv

# required to fix python shebang lines
curl -sSL https://raw.githubusercontent.com/lsst/shebangtron/master/shebangtron | python

# optional -- run the stack demo
curl -sSL https://raw.githubusercontent.com/lsst-sqre/buildbot-scripts/master/runManifestDemo.sh | bash -s -- --small --tag w_2017_25

eups distrib install preferentially installs binaries, and falls back to source installations of packages.

What is a “tarball” binary package?

Many DM developers are already familiar with the EUPS distrib source, or eupspkg, packages. An EUPS distrib binary, or tarball, package is very similar in functionality to eupspkg, except that it contains the verbatim contents of an installed EUPS product rather than source code which requires an extra build step.

EUPS distrib has support for searching through multiple EUPS_PKGROOTs. It’s possible to transparently mix the usage of tarball and eupspkg from different repos.

What binary builds are available?

In general, tarballs for the lsst_distrib “meta” product and all of its dependencies are created as part of the automated DM weekly tag/release process. EUPS distrib does appear to have the ability to list which tags are available in a repo (tarball or eupspkg). It is possible to browse an HTTP accessible repo to discover which tags are present. Eg., https://eups.lsst.codes/stack/redhat/el7/gcc-system/miniconda3-4.2.12-7c8e67/ where the available tags are files that end with .list.

target platforms

Binary object code is inherently platform specific, which means a separate binary build has to be produced for at least each operating system. In some cases a single binary will work [as intended] on multiple releases of an operating system, depending on lib/ABI changes. This is currently the case for our macOS binaries, which are known to work on 10.11 (“El Capitan” and 10.12 (“Sierra”), but not for the Centos/EL binaries. The major.minor release of python linked against also requires a separate set of binaries. Similarly, the the compiler used may also require separate binary releases due to issues with combining object code generated from different releases that could prevent an end-user from building against a binary distribution.

operating system + compiler / python major.minor version matrix

At present, binaries are being built with a single compiler per operating system version. However, we anticipate that compiler versions may change in the future and that the end-user should be protected from installing EUPS binary products built with different compiler versions. Thus, the compiler is treated as a separate axis in the matrix of binary builds.

|            | el6+devtoolset-3 | el7+gcc-system | osx+clang-800.0.42.1 |
|------------|------------------|----------------|----------------------|
| py2.7%     | X                |  X             | ^                    |
| py3.5%     |                  |  X             | ^                    |
  • % In reality, these are specific miniconda distribution versions
  • ^ not currently being produced – coming soon

That’s it?

Actually, no. Since various DM software packages link against libraries that are install from conda packages, the specific of the python/conda package env the build was run is needs to be tracked as well. This is currently being represented as a 6 character abbreviated sha1 has of the lsst/lsstsw repo that presents the specific conda environment files used to bootstrap the build environment.

My head just exploded…

TL;DR – newinstall.sh will attempt to figure out all the parameters needed to use binary tarballs when you are on a supported platform. The hope is that most binary consumers will use newinstall.sh and not need to worry about the details. In fact, newinstall.sh is the only supported vehicle for consumption of the binary tarballs at this time.

A convention for a hierarchy of EUPS distrib repos has been worked out with the hopes of enable a EUPS_PKGROOT to be constructed heuristically.

All repos need to live under a base URL. At the present time, the base URL for published EUPS distrib product is https://eups.lsst.codes/stack . A source or eupspkg repo may live at <base url>/src. At present, that would be https://eups.lsst.codes/stack/src .

There are [currently] ~4 parameters required to construct the path under the base URL for a binary release:

| parameter                               | example                   |
|-----------------------------------------|---------------------------|
| `<os family>/<platform>`                |`redhat/el7`               |
| `<compiler>`                            | `system-gcc`              |
| `<miniconda py[23]>-<minconda version>` | `miniconda3-4.2.12`       |
| `<lsstsw abbrev-ref>`                   | [7c8e67](https://github.com/lsst/lsstsw/tree/7c8e670ce392ea11c64b4c326a130d6fa7f2d489/etc) |

Putting those values together would construct an EUPS_PKGROOT URL of https://eups.lsst.codes/stack/redhat/el7/gcc-system/miniconda3-4.2.12-7c8e67/

We can also have EUPS transparently fall back to using eupspkg if a tarball build of a specific product/version isn’t available. Eg.,

https://eups.lsst.codes/stack/redhat/el7/gcc-system/miniconda3-4.2.12-7c8e67/|https://eups.lsst.codes/stack/src

How to list EUPS_PKGROOT components

$ eups distrib path
https://eups.lsst.codes/stack/redhat/el7/gcc-system/miniconda2-4.2.12-7c8e67
https://eups.lsst.codes/stack/src

Can’t you make this less complicated?

We tried hard to hide more of the complexity but the nature of DM’s custom build tool-chain proved to have fatal edge cases when packaging up a complete miniconda/python environment as an EUPS distrib product.

How much of this do users need to know? I’d expect them to only need to set EUPS_PKGROOT and then run the install command. They might need to upgrade eups first, but why a full newinstall?

The installation should be able to go into a pre-existing stack.

If a user does not want to use newinstall.sh, they need to know how to form the EUPS_PKGROOT URL. I believe that just enough information is provided to do that. However, unless the compiler, miniconda release, and conda package env matches the newinstall.sh created build env, we know there are likely to be problems with linkages (Eg., numpy).

Do you think we should add instructions on how to recreate the same environment that newinstall.sh creates?

The instructions might be clearer if the download and run stages were separate rather than showing what some people might see as magic. Are you saying that newinstall.sh does not prompt for tarball mode like it prompts for other options?

Which sounds like you have to tell people how to set up their system to match these builds if they intend to use these binaries as a basis for doing their own development builds.

so if people upgrade to python 3.6 everything breaks? Or are you saying that if the conda ABIs don’t change for minor updates people can update in place.

for the moment, how many combinations are you actually supporting? You presumably don’t have many different combinations for each weekly. Also, I assume this means that if you upgrade to python 3.6 all the subsequent weeklies will disappear for people and they’ll have to reinstall from scratch (or else try to work out how to change their package root) or suddenly start doing source installs unexpectedly.

This does seem to be significantly more flaky than conda installs (not conda builds of course, they were horrendous).

Yes. It is currently implimented as a option flag or an env var. There is not a yes/no prompt at this point as we’d like some postiive feedback from our more advanced users.

[C]Python does not have a [complete] stable ABI. In addition, miniconda could change which libraries and bundled and linked against in the distribution. See: https://docs.python.org/3/c-api/stable.html

5

No. newinstall.sh has been refactored so that you can re-run it on top of an existing install (which will update the EUPS_PKGROOT). I am considering adding a utility or shell function to automate the update.

That is also my impression. The lsst conda packages were more isolated from the rest of the system due to more extensive usage of public conda packages (in place of some prereq. packages) and dependencies between lsst/public conda packages. In addition, conda handled fixing up the shebang lines, a step which will have to be done manually by the end user on OSX.

or we patch eups to allow a post install hook.

Why does newinstall need a -t option? Is that used to set the EUPS_PKGROOT to point at binaries?

I think that it would indeed be cleaner to have script that set EUPS_PKGROOT for the user (as you propose).

Yes. Initially, this was the default behavior of newinstall.sh but this broke the workflow of existing users that were using a different compiler and needed to build against the distrib install. So I very hastily added the -t and made source only installation the the default. See https://github.com/lsst/lsst/pull/44

We could look into providing a utility along those lines. newinstall.sh has the benefit of knowing the target miniconda version and conda module set to use when selecting the correct repo. Probing the python version in the user’s env is trivial but reverse mapping the conda package set would require some thought.

Can’t you look at CONDA_DIR? The workflow would be:

  • get into the environment you want to use LSST in
  • run script to set EUPS_PKGROOT (or to write an lsst.table file to set it and declare it)
  • run eups distrib install ...

And going with the second approach wouldn’t have broken existing users as they wouldn’t have changed their EUPS_PKGROOT.

AFAIK - CONDA_DIR is not a standard env var – some script declare it for their own usage. However, finding out what is in the conda env isn’t difficult as long as the conda executable is in the path. What would be necessary is comparing what is in the current conda env to the conda env export files in lsst/lsstsw and mapping that back to the abbreviate sha1 used in the repo name. It would be simpler for the user to to declare the desired conda package set – which is essentially how newinstall.sh currently works.

You’re right; mea culpa. But you can get it from the python path, can’t you?

I think we want to support, “Here is my world. Please add/update LSST to/in it”

Also, the way that e.g. compiler versions and (I suppose) py2/3 issues are usually handled is by setting the eups “flavor”. I think that that’s probably the way to handle these issues too, possibly by adding a “topping” to the flavor rather than concatenating values (i.e. flavor==“DarwinX86” and topping=“stdlibc_py3” rather than flavor=“DarwinX86_stdlibc_py3”). That would allow people to have multiple shared stacks installed with e.g. py2 and py3 in a clean way.

1 Like

I thought I would share my experience installing the stack, in case it would be helpful for other users. But let me first say that having the possibility of installing binary releases is really great. In my build platform, building the stack (i.e. lsst_distrib) from source takes not less than 3 hours; installing binaries takes about 10 minutes. So, thanks for this really appreciated improvement.

At CC-IN2P3 we have been deploying all the available weekly versions of the stack since January this year. Those deployments are intended for end-users of the stack, as opposed to developers of the stack. We deploy in a read-only area of a shared file system which is visible from all the nodes in both the login farm and the batch farm.

Each weekly version of lsst_distrib is integrally installed from source (and now binary, when available) in its own directory via newinstall.sh. Each version includes the Python interpreter, which comes with the version of miniconda required by each specific weekly, and all the required extra packages. This makes each deployed version self-consistent and almost* self-contained.

Since we are preparing the transition to use exclusively Python 3, we are currently deploying 2 versions for each weekly: one with Python 2 and one with Python 3. We will stop doing this once everybody is ready to use Python 3 only.

Over the last 6 months, this approach has proven useful for end-users: they can easily test their software using any of the available versions of the stack just by changing a directory path; they can very easily check for regressions in the stack and get prepared to the definitive transition to Python 3.

When a weekly version becomes obsolete or nobody uses it, its directory can simply be removed. The price to pay is to consume more disk space, but in my opinion the benefits of this approach outweigh that cost.

We are considering making the weeklies also available in the cloud via CernVM FS: currently only the stable versions are publicly available there. This requires that we finish the ongoing revamp of the distribution infrastructure.


*: I say “almost” because there are some software components such as git or the C/C++ compiler which are not part of a given weekly version.