Parquet format and Qserv

MRead · May 21, 2020, 8:43am

Hi

Some questions on parquet format catalogs and Qserv

I’ve had a look at

https://confluence.lsstcorp.org/display/DM/Notes+on+Parquet+as+a+released+data+format

and

https://confluence.lsstcorp.org/display/DM/Evaluate+Ingest+Improvement+Requirements+for+Qserv

but was hoping someone could confirm a few things.

i) is there any plan to ingest/import parquet files directly into Qserv or will they always be first converted to TSV/CSV

ii) will the pipeline stack be generating FITS catalogs and converting to parquet catalogs or straight to parquet.

Both possibly TBD but just trying to get an idea of how parquet fits into the data-flow. Sounds like they might be used mainly for storing and distributing the catalogs.

Thanks
Mike

ktl · May 21, 2020, 12:57pm

RFC-662, pointed to from the first Confluence page you cited, is also relevant; it will (eventually) result in a document (via work on ticket DM-24549) that should help answer your questions.

For technical/efficiency reasons, as long as the Qserv workers use MySQL/MariaDB the answer to i) is likely to be TSV, as LOAD DATA INFILE is very fast. For ii), there is no guarantee that FITS persistence of catalogs will continue, and I would not be surprised to see it diminish in importance.

MRead · May 22, 2020, 3:34pm

thanks, we’re looking again at Qserv in its more recent version and loading data and had wondered if we should be testing with parquet. But if the actual loading will likely always be via TSV then there maybe doesn’t seem much point looking at the parquet side of things at this stage.

If data releases are distributed in parquet converting to TSV will presumably be a relatively simple step in the data flow at that stage.

thanks again
Mike