Hi Mike
regarding your questions on partitioning the input data for Qserv. Your first question in this context was:
Originally the director_key was set to “gaia_source_ID” but I tried changing it to objectId in case the underscores affected things. The director_key is not the first column in the tsv files, does it need to be?
Underscore wouldn’t be a problem for as long as it doesn’t start with qserv_
or the names of the columns are not chunkId
or subChunkId
. Qserv may inject special columns that start with the prefix qserv_
to store various meta information within the data tables. At the moment we have only one such column qserv_trans_id
. More columns could be added in the future. The other two columns I’ve mentioned above also have special meanings in Qserv.
Now, let me move to another topic on the meaning of the parameter part.id
in a config file for the sph-partition
tool. Please, note that you may also specify this parameter as the command-line option when running the tool:
sph-partition --part.id=<col-name> ...
This is the name of a field that has an object identifier. If it’s provided then the secondary index will be open or created.
To explain why or when one might need to use this option, let me dive a bit into a design of Qserv. As you already know, Qserv implements the horizontal (which is different from MySQL’s vertical) partitioning of tables by spreading table data between worker nodes. Each node has a subset of the full table’s data. The data are stored in the “chunked” tables. Those tables are named after the name of the main (logical) table such as your Xmatch
and the corresponding value of the chunkId
column. For example, qserv-worker-0
of your set up would have the following tables after you finish the ingest of chunk 179619
:
testWP311a.Xmatch_179619
testWP311a.XmatchFullOverlap_179619
...
Tables that are partitioned in this way are known as partitioned tables. Note that Qserv also supports fully replicated tables that are known as regular tables. A full copy of a given regular table is supposed to be ingested into each worker node of Qserv. I’m not going to discuss further details on why one would want to ingest a table as the regular one. It would be another interesting topic that you may eventually hit on the road. Back to the partitioned tables. Tables of this kind form a logical tree, in which there is at least one so-called director table, and 0 or more so-called dependent ones. Qserv knows (from configuration files that you’re presenting to the Ingest system) what flavor of the table you have in each case, and uses this information in processing queries that involve either (or both of these tables).
In any case, data of both tables have to be partitioned in order to be ingested into Qserv. The ingest-time attribute is_director in the table’s config file is meant to differentiate between these tables.
What is important to know here is that rows of the dependent tables are required to be associated with the corresponding rows in some director table. This association is similar to the FK -> PK
association in the traditional RDBMS. Qserv requires that rows that are linked by this association in both director and the dependent tables be found within the same chunk. This introduction brings us to the essence of your question - when partitioning the input data of the director and the dependent tables we need to ensure that the corresponding rows were placed within the correct chunks, so that Qserv could successfully execure queries like this:
SELECT * FROM director AS dir, depedent AS dep WHERE dir.prop == "123" AND dir.objecId = dep.objectId;
The director tables are always partitioned on the values of their RA/DEC columns. Things are a bit more complicated for the dependent tables. There are two options here:
- If the dependent table has the RA/DEC columns, and values of these columns for a given row exactly match the values of the RA/DEC columns of the corresponding row of the director table then you may partition the dependent table based on RA/DEC. In this case, you don’t need to specify the option
--part.id
. Note that you’re still required to ensure your data are correct in terms of the referential integrity across rows of the related tables so that for each row’s objectId
in the dependent table there is a matching row with the same objectid
in the dependent tables (of that director).
- The second scenario is seen when a dependent table doesn’t have spatial columns. In this case, you would have to help the partitioning tool
sph-partition
to figure out at which chunk to place each row of the dependent table. This requires following object identifiers from the dependent tables to their director table during the partitioning stage. The option --part.id
is used by the tool to create a simple file-based index at chunk_object_index.txt
. Each line of this index has triples of 3 columns: (objectId
,chunkId
,subChunkId
). Note that this index is partial as it’s based on what’s found in your input files. If you have many input files (CSV
/TSV
) then you would have to consolidate all files you get after finishing partitioning data of the director table into a single file. You will need this file when partitioning the input data of the corresponding dependent tables. You would have to pass a location of that index file to the tool via the option --part.id-url=<path>
. Note that you would also need to specify the name of the objectid
column of the depedent table as well via --part.id=<col-name>
. This combination of parameters will tell the tool that you’re partitioning the dependent table so that it could pull chunk numbers for each row of the dependent table from the index file.
I hope this helps. Though, this is a complicated topic. Please, contact me should you have any further questions on this subject. I may also need to extend the documentation on the Ingest system to cover this topic and to provide examples on how to partition data of the related (director and dependent) tables.
Regards,
Igor