Creating Qserv ingestion files

raphaelshirley · July 26, 2021, 4:12pm

We are preparing to conduct pipeline runs and ingest the results into Qserv. For the purposes of this question this should be similar to a standard run producing measurement and forced catalogues from coadded images. Currently, we are using the Butler to retrieve catalogues and photometric calibrations. We then use the lsst.afw.table.BaseCatalog.asAstropy function to get the full table which we modify in various ways before writing to CSV for ingesting to MySQL. You can see a rough test script here.

Firstly, is this basic procedure fundamentally wrong? That is, am I missing some already available means to turn Butler products into Qserv and/or MySQL ingestible CSV, TSV, or Parquet files?

Secondly, if the answer to the previous question is no, what key features should the resultant CSV tables have? We are making various decisions regarding 32 vs 64 bit floats, maximum column name lengths, whether to duplicate repeat columns for each band, whether to add new columns to facilitate useful joins, and whether to create additional metadata tables to aid queries etc. If the asAstropy function is not optimised for creating final ingestible files what are the key things which we should modify?

Finally, we are unsure how Qserv is distributing blocks according to id or sky position. We want to make sure our new tables will be sensibly distributed under the same system to optimise queries. These new tables will contain the public LSST objects in addition to new detections we are merging with the mergeCoaddDetections task. Should this point be an important consideration when making the tables for ingestion? We are concerned that Qserv might distribute the LSST detected objects in different blocks to the same objects in the main LSST tables such that joins on those ids are not efficient.

We have progressed since the previous questions on this topic that we asked here and here and would appreciate any information about future plans or recent development on this front. We are making and ingesting CSVs successfully but want to make sure the final tables are as close to the main public Qserv ones as possible.

Many thanks,

Raphael.

price · July 26, 2021, 4:44pm

I don’t know about LSST database loading scripts in general or qserv in particular, but from doing large loads into MySQL databases in the past, I think you should avoid loading via CSVs if you can at all help it. Loading via a CSV is slow because parsing the CSV is a much slower operation than writing into the database. If you’ve already got the data in memory, then load it into the database directly from there.
You should probably also avoid the asAstropy method, as it’s known to be slow too.

hsinfang · July 26, 2021, 8:02pm

Recent versions of the stack do include pipelines to generate parquet files that are almost standardized to our DPDD. We do not plan to ingest afw tables directly.

I’ve done some Qserv catalog ingest starting with the stack’s parquet outputs such as the objectTable_tract dataset from butler repos; https://dmtn-170.lsst.io/ is my previous writeup of the process, including what tools/workarounds were used. The tooling is evolving quickly but the overall concepts in this DMTN still pretty much reflect my current understanding. You might want to take a look as a reference.

raphaelshirley · July 27, 2021, 10:04am

That is interesting thank you @price. At the moment we aren’t too worried about speed for testing but that will become very important for the all sky runs. We will try the parquet option.

Thanks too to @hsinfang for the link to the document which is very useful.

I see that the writeObjectTable.py command line task is set up to merge bands into a parquet file.

writeObjectTable.py $REPO --rerun $RERUN --id \
filter=$FILTERS \
tract=$TRACT patch=$PATCH

I will use that and then look into copying the HSC proof of concept example which makes use of the parquet-tools.

hsinfang · July 27, 2021, 5:50pm

And the newer tooling is here: GitHub - lsst-dm/qserv-ingest: Tools for loading DC2 data inside Qserv