That’s interesting. Going forward out recommended installation method for end users will be EUPS binaries; I wonder if we could set up Git LFS for users by default. The other half of it is installing/modify the ~/.gitconfig
and ~/.git-credential
files. That’s also something that we’d want to make as easy as possible.
On the other hand, if we want to combine the Git LFS versioned file workflow we already use as developers with the ease of simple URL-based downloads for users, here’s a design we could implement. I’m not saying we will implement this. We’d have to decide if it’s worth the expense of developing and running a new service.
HTTP proxy service for Git LFS repos
Deployment
The service is a simple HTTP server that runs on one of SQuaRE’s existing Kubernetes clusters. We’ve found this deployment pattern minimizes the operational burden of running services, making things like load balancing and rollouts quite easy.
The service could be behind a content distribution network like Fastly. This would be useful for workshops, for example, where many people are downloading the same files at around the same time from the same geographic location.
URL format
The service responds to GET
requests to URLs formatted like this:
https://datasets.lsst.io/<org>/<repo>/<gitref>/<filepath>
-
<org>
and <repo>
map to a GitHub repository: https://github.com/<org>/<repo>
.
(We would whitelist the GitHub organizations that are proxied through the service.)
-
<gitref>
is a Git ref, such as a branch, tag, or commit SHA.
-
<filepath>
is the path of any file or directory in the Git repository.
Proxying files
When responding to a GET request, the service uses GitHub’s API to download the contents of the file given the URL.
-
If it’s a regular Git-backed file, the service would pass the file directly to the original requestor. For Git-backed files, the service is a simple proxy to GitHub’s API.
-
If it’s a Git LFS-backed file, the service will read the hash from the placeholder file stored in Git at that file positiion. That hash is used to compute a URL in LSST’s Git LFS S3 bucket. The service would pass through those file contents to the original requestor. This functionality is what’s unique to the service.
Directory listings
If the URL points to a directory in the Git repo, rather than a file, the service will return a list of all file URLs contained in that directory.
This allows us to use a tool like wget
to download the full contents of a directory.
Compressed directories
Instead of downloading individual files, we might want to download an entire directory as a .tar.gz
file. I don’t think the service can make a tarball live during a user request, so instead the service could prepare tarballs ahead of time.
In a Git repositories directory, we could add a .datasets.lsst.io
configuration file to indicate that the directory should be downloadable as a tarfile.
The service recieves webhook events for commits to GitHub repositories and when a commit is made to a repository with .datasets.lsst.io
configuration files the service launches a background queue to make a tarball of the designated directories. Those tarballs would be stored in the service’s own S3 bucket, and served to users who make a
GET https://datasets.lsst.io/<org>/<repo>/<gitref>/<directory>.tar.gz
request.
I don’t think any of this service’s functionality is particularly difficult to develop. Compressed directories would be more difficult to implement than the pass-through file poxying but we don’t need compressed directories as an MVP.