Sorry a couple more questions on Qserv to help plan how we ingest additional datasets (eg cross-match tables).
i) Can new tables be added to (ingested into) a published database?
ii) During ingest I think the unpublished database and tables are not (readily?) visible in qserv. Does the ingest procedure directly create the underlying table files and structure as the data are loaded and then during publication the database is created at each node and these table files are “attached”. Or are the data being loaded via traditional MySQL commands but the tables/database are just not visible. Or indeed some other way? Documentation on qserv-replica-file mentions a proprietary binary protocol but not sure if this just refers to transferring the data to the nodes.
let me address your questions here. I will begin with the simple one - regarding the binary protocol employed by qserv-replica-file. Yes, indeed this tool is meant to transfer data from your local filesystem (where you run the tool) directly to the select worker node where the data fill be uploaded into the corresponding table (chunked table) within Qserv. The worker node is supposed to be running a special ingest service for processing and uploading the data. Like in case of other ingest options (via HTTP) you need to know which worker should be used for a given file (table/chunk contribution).
Regarding adding tables to a published catalog. It’s doable (there is a way to do it), however, it’s a bit complicated with the current version of the Qserv Ingest system. I have a ticket that’s meant to address this limitation [DM-28626] Allow un-publishing catalogs in the Qserv Replication/Ingest system - Jira. If there is a chance to ingest the whole catalog (including the missing tables) from scratch then I would recommend it as it would be much easier. Otherwise, I could prepare some unofficial recipes for you.
Now to your last question regarding the visibility of databases and tables during ingests. You were right here - the ingest system directly creates the underlying table files and structure as the data are loaded at the corresponding worker nodes. The data go directly to where they will be residing. No further data movement is expected (unless Qserv is configured with the replication_level > 1) These tables are not seen (or, to be correct, recognized as ready to serve data) by Qserv. The catalog publishing stage is what makes the new data seen/recognized by Qserv. Here is what’s going on during catalog publishing:
MySQL tables at workers get consolidated (during catalog ingests the tables are created using MySQL partitions in order to implement the “super-transactions” and to increase the overall performance of the ingest). Depending on a size of a catalog this could be a lengthy process.
Table schemas for the newly ingested tables get injected into Qserv’s “czar” database
Catalog metadata (partitioning parameters of the catalog, etc.) gets injected into Qserv’s “czar” database
Internal MySQL authorizations are set at all MySQL servers of the Qserv instance for the new catalog.
After these steps, Qserv can see the new catalog.
I hope this answers your questions. If you will have any further questions, please ask them here, or in the Slack channel #dm-qserv.
I’m not sure we’ll need to add tables unless it proves impossible to query across two databases that have partitioned tables but good to know it might be possible.
Our potential cross-match workflow at the UK-DAC is we transfer and ingest and publish an LSST Data Release (LSST_DR). In parallel to ingest we will extract a skinny set of attributes from the Object table.
This skinny table with be cross-matched with another dataset (say Gaia), the result will be another skinny catalog/table of the LSST ID and the Gaia ID and a few attributes.
We want to ingest this table back into qserv, in a new(?) database (LSSTxGaia). We would then want to be able to query the crossmatch table joined with the LSST_DR.Object table to pull out additional information for the matches. Ideally we would also want to be able to join with the full Gaia dataset too. All databases would have the same partitioning schema.
When we looked at this a few years back it was difficult to do these sort of joins (via IDs) across databases but you could put in an inefficient geometric join. I think this was something that was going to be looked at implementing in future releases of qserv. Does this sort of cross database join now sound possible. The fall back would be to add the crossmatch table back into the LSST database and it could be directly linked as it would have a director id column.
On the subject of ingesting data directly to the underlying table file structure does this mean that in principal other file types (parquet) can be ingested as you are not having to go via mySQL load csv?