Are ObjectIDs ordered across tracts in some predictable way?
Or, more precisely, are the ObjectIDs from a given tract drawn from a contiguous block such that no blocks overlap across tracts?
I would like to use Dask to analyze the 166 tracts from DESC DC2. There is a natural partitioning by tract for the original Parquet file access. But then I would like to use
objectID as my DataFrame index. Performance would be much better if I can avoid having to do a shuffle when I set the index. Can I make assumptions that objectIDs are assigned in some ordered way across tracts?
I don’t think I need for there to be any particularly relationship between the actual value of the tract number ObjectID. I just need for each tract to have a block of ObjectIDs and that no ObjectIDs from other tracts fall in the range of that block. If I can guarantee that, then setting the index shouldn’t be painful. I may have to work a little bit extra hard to convince Dask that this will be the case; but I think it’s pretty possible.