How should we store 100 GB + data sets? Is git-lfs still the answer?


How should we store external data sets that are on order 100 GB?
Is git-lfs still the way forward?


From the deployment side, do you have concerns or suggestions for having 100 GB data sets available to build, test, and validation slaves?

The particular context motivating this question is expanding the validation data sets. In particular, @price is kindly helping to make available curated (public) data from HSC for use in testing. To fully test co-addition and get large enough areas to have enough objects to perform the variety of tests we plan, we would like to store a reference dataset of 138 GB.

This may not be applicable in this context, but looks interesting from an efficient data storage and access standpoint.

I don’t want to preempt responses from SQuaRE, but my understanding is that git-lfs is indeed the way to go, assuming that the data sets need to be versioned (which they would be if used for build/test/validation).

I do worry a bit about bandwidth and latency if such data sets are retrieved frequently by automated processes. It may make sense to arrange for them to be pre-staged for automated use, particularly if they have low rates of change.

@mwv I think a 100GB data set would be fine on the build slaves as long as we cache it. I wouldn’t be surprised if the initial download takes over an hour since our git-lfs server implementation doesn’t currently support “batching”. Theoretically, it should take less than 15mins at a sustained 1gbit/s.

In the general and imminent case, K-T is right and we are likely to git-lfs it right now.

In the eventual and scalable case (multiple 100+ GB datasets flowing around) I would prefer to deal with them using the remote butler repository functionality against an object store, or something like that. But we don’t need to design this now.

Git LFS is certainly capable of storing 100 GB objects. We’re using AWS S3 to store our git-lfs objects. The S3 limit for a single object is 5 TB. So we’re unlikely to run into that limit.

Yes, that can make sense, particularly if we manage to build local caching into the Butler. I’m still a little worried that versioning via the Butler will not be as fine-grained, trackable, or synchronizable with code releases as versioning in git-lfs.

Another interesting option is RethinkDB (although my focus is more on data querying than data storage so it might be orthogonal to this discussion)

More info…

One more idea: if you’re considering mining the data set with Big Data analysis tools like Apache Spark, you may want to use a format supported by Sqoop:


  • CSV