How should we store 100 GB + data sets? Is git-lfs still the answer?

mwv · April 27, 2016, 2:12pm

How should we store external data sets that are on order 100 GB?
Is git-lfs still the way forward?

From the deployment side, do you have concerns or suggestions for having 100 GB data sets available to build, test, and validation slaves?

The particular context motivating this question is expanding the validation data sets. In particular, @price is kindly helping to make available curated (public) data from HSC for use in testing. To fully test co-addition and get large enough areas to have enough objects to perform the variety of tests we plan, we would like to store a reference dataset of 138 GB.

benepo · April 28, 2016, 1:03am

This may not be applicable in this context, but https://google.github.io/flatbuffers/ looks interesting from an efficient data storage and access standpoint.

ktl · April 28, 2016, 10:04am

I don’t want to preempt responses from SQuaRE, but my understanding is that git-lfs is indeed the way to go, assuming that the data sets need to be versioned (which they would be if used for build/test/validation).

I do worry a bit about bandwidth and latency if such data sets are retrieved frequently by automated processes. It may make sense to arrange for them to be pre-staged for automated use, particularly if they have low rates of change.

josh · April 28, 2016, 3:33pm

@mwv I think a 100GB data set would be fine on the build slaves as long as we cache it. I wouldn’t be surprised if the initial download takes over an hour since our git-lfs server implementation doesn’t currently support “batching”. Theoretically, it should take less than 15mins at a sustained 1gbit/s.

frossie · April 28, 2016, 5:33pm

In the general and imminent case, K-T is right and we are likely to git-lfs it right now.

In the eventual and scalable case (multiple 100+ GB datasets flowing around) I would prefer to deal with them using the remote butler repository functionality against an object store, or something like that. But we don’t need to design this now.

jmatt · April 28, 2016, 7:12pm

Git LFS is certainly capable of storing 100 GB objects. We’re using AWS S3 to store our git-lfs objects. The S3 limit for a single object is 5 TB. So we’re unlikely to run into that limit.

ktl · April 28, 2016, 7:15pm

Yes, that can make sense, particularly if we manage to build local caching into the Butler. I’m still a little worried that versioning via the Butler will not be as fine-grained, trackable, or synchronizable with code releases as versioning in git-lfs.

benepo · May 4, 2016, 5:43pm

Another interesting option is RethinkDB (although my focus is more on data querying than data storage so it might be orthogonal to this discussion)

Scalability supports terabytes of data
Open source
Supports Ruby, Python, and JavaScript
JSON/NoSQL based (nicely pairs with LSE-131 2.7: Catalog Format From DM “The Data Management System shall deliver catalogs to EPO in compressed JSON format”)
Time series data
Binary data
Geospatial data
Subscribe to receive notifications (nicely pairs with alert stream model)
API seems user-friendly
Nice admin interface and interactive data explorer

More info…

benepo · May 4, 2016, 7:18pm

One more idea: if you’re considering mining the data set with Big Data analysis tools like Apache Spark, you may want to use a format supported by Sqoop:

ASCII

CSV

Binary

Avro
Sequence Files
Parquet