Distributing sample data for documentation examples/tutorials

Getting datasets to users is one of the usability issues I’m grappling with for our Pipelines documentation projects, particularly tutorials.

So far we’ve used Git LFS-backed repositories like ci_hsc to provide datasets. While this is fine for CI, we’ve never committed to using Git LFS for user-facing products. Given the nuances in configuring Git LFS for LSST, I’m not sure this is a path I want to take.

We could instead upload individual datasets as tarballs to a static web server (S3, or similar). The URL itself is sufficient for identifying and versioning the dataset. Whenever we need to update an example dataset we upload a new one and change the corresponding URL in the documentation.

I think Astropy does something like this, and even have tools for managing example datasets.

Another idea is to use a DAX webserv in a read-only mode to supply datasets for documentation. Is this something we can offer? I don’t think that all datasets could be served from DAX/Butler (for example, a tutorial might start with a collection of raw FITS files in order to show how images are ingested into a local repository), but a DAX could solve most data needs.

I’m curious what ideas you all might have for serving example datasets, and whether a “webserv for documentation” is something we can stand up.

Could we put together a simple web server that generates versioned tarballs from git-lfs repos on-the-fly via server-side git archive calls? Users don’t usually care about getting the git metadata with their test dataset, but they will care about getting the version that matches their version of the pipeline.

That’s a really good idea. The URL would contain both the Git reference and the file path (a lot like GitHub’s blob paths, https://github.com/lsst/ci_hsc/blob/720662c9d0b8171aa6c1208be46e8115453ae10e/SConstruct).

Maybe we can even download tarballs of directories if the download URL points to a directory and we add a .tar.gz extension?

I think one thing I don’t want to do, in general, is force people to download a whole Git LFS repository. Many tutorials will need just a few sub-directories in an example data Git LFS repo.

Normal tar files (.tar.gz) of sample test data under 50MB per file would work within the normal file size constraint of GitHub. For performance, half of that limit (~25MB or a couple sdss fits images ) may be more desirable for tutorial/documentation use cases.

@kennylo, you’re suggesting we commit files straight to GitHub and have the user download the blob straight from GitHub?

(otherwise, I don’t think we want users to have to clone tutorial data repositories. All those versions of data, and unrelated datasets from other tutorials, would make for a heavy git clone).

That’s a viable option, provided we can live within the GitHub constraints, as mentioned. As I understand it, files in Git repositories are stored efficiently; only the delta’s, of large files which got broken down into chunks, are versioned. Arguably it will cost us more storage doing this on our own, unless we end up using Git or something similar.

This is my contribution to this discussion based on my experience preparing the datasets for the hackathon we had last Friday for http://lyon2017.lsst.fr.

For this hackathon we used the ci_hsc data mentionned by @jsick. The way we proceeded was to git clone that repository, remove the git-specific subdirectories, build a .tar.gz file and upload it to CC-IN2P3 OpenStack Swift instance, making it world readable. From this, we could just use curl to download and deploy the dataset locally to each of the VMs we created for our event.

The instructions to deploy the dataset were straightforward:

$ cd /datasets
$ curl -OL "https://ccswift.in2p3.fr:8080/v1/AUTH_e81854c1742348c282f66df844388aa4/hackathon/ci_hsc.tar.gz"
$ tar -zxf ./ci_hsc.tar.gz

As suggested by @jbosch, a better naming convention would be welcome to reflect the version of the dataset.

Uploading the .tar.gz file to Swift and giving it the appropriate access permissions for it to be publicly downloadable requires better tooling than the CLI provided for Swift. At IN2P3 we have been exploring mechanisms for end users to easily share files of significant sizes, such as datasets. To upload the data and set access permission we would like to have more convenient tools than we currently have (to understand what this means please see here). I plan to work for helping improving the current situation.

In my opinion, having datasets in the cloud, easily downloadable by end users using ubiquitous tools (such as curl, wget, etc.), is really valuable. We used Swift in our case but Amazon S3 would be equally convenient and essentially transparent from the end user’s perspective.

One more vote for something like tarballs accessible via curl & wget.

In sims, we did experiment with user-facing git-lfs for some of our repos which contain significant amounts of data. I believe we would characterize the result as being “not ready for primetime” … although actually with the conda installation (once the conda install was configured to set up git-lfs), it was actually not too bad (but was terrible before the conda installation auto-configured this).

That’s interesting. Going forward out recommended installation method for end users will be EUPS binaries; I wonder if we could set up Git LFS for users by default. The other half of it is installing/modify the ~/.gitconfig and ~/.git-credential files. That’s also something that we’d want to make as easy as possible.

On the other hand, if we want to combine the Git LFS versioned file workflow we already use as developers with the ease of simple URL-based downloads for users, here’s a design we could implement. I’m not saying we will implement this. We’d have to decide if it’s worth the expense of developing and running a new service.

HTTP proxy service for Git LFS repos

Deployment

The service is a simple HTTP server that runs on one of SQuaRE’s existing Kubernetes clusters. We’ve found this deployment pattern minimizes the operational burden of running services, making things like load balancing and rollouts quite easy.

The service could be behind a content distribution network like Fastly. This would be useful for workshops, for example, where many people are downloading the same files at around the same time from the same geographic location.

URL format

The service responds to GET requests to URLs formatted like this:

https://datasets.lsst.io/<org>/<repo>/<gitref>/<filepath>
  • <org> and <repo> map to a GitHub repository: https://github.com/<org>/<repo>.
    (We would whitelist the GitHub organizations that are proxied through the service.)

  • <gitref> is a Git ref, such as a branch, tag, or commit SHA.

  • <filepath> is the path of any file or directory in the Git repository.

Proxying files

When responding to a GET request, the service uses GitHub’s API to download the contents of the file given the URL.

  • If it’s a regular Git-backed file, the service would pass the file directly to the original requestor. For Git-backed files, the service is a simple proxy to GitHub’s API.

  • If it’s a Git LFS-backed file, the service will read the hash from the placeholder file stored in Git at that file positiion. That hash is used to compute a URL in LSST’s Git LFS S3 bucket. The service would pass through those file contents to the original requestor. This functionality is what’s unique to the service.

Directory listings

If the URL points to a directory in the Git repo, rather than a file, the service will return a list of all file URLs contained in that directory.

This allows us to use a tool like wget to download the full contents of a directory.

Compressed directories

Instead of downloading individual files, we might want to download an entire directory as a .tar.gz file. I don’t think the service can make a tarball live during a user request, so instead the service could prepare tarballs ahead of time.

In a Git repositories directory, we could add a .datasets.lsst.io configuration file to indicate that the directory should be downloadable as a tarfile.

The service recieves webhook events for commits to GitHub repositories and when a commit is made to a repository with .datasets.lsst.io configuration files the service launches a background queue to make a tarball of the designated directories. Those tarballs would be stored in the service’s own S3 bucket, and served to users who make a

GET https://datasets.lsst.io/<org>/<repo>/<gitref>/<directory>.tar.gz

request.


I don’t think any of this service’s functionality is particularly difficult to develop. Compressed directories would be more difficult to implement than the pass-through file poxying but we don’t need compressed directories as an MVP.

[Sorry I’m just getting around to reading this.]

A potentially significant optimization: can you redirect the request to GitHub or S3 rather than passing the file contents through the service?

Yes, that’s a great idea.

I also recently re-worked our Git LFS configuration documentation (https://developer.lsst.io/tools/git_lfs.html) and found out that anonymous configuration is a lot easier than expected. I think I can initially roll out tutorials with Git LFS data packages and work on an HTTP download later.