@boutigny has been recently and repeatedly processing HSC data using the LSST stack on the batch farm of CC-IN2P3. Both the raw HSC images and the resulting reduced images are stored in our GPFS cluster.
During those processing periods we have observed low CPU efficiency (i.e. the ratio CPU time/wall time) of most of @boutignyβs jobs which we think could be a consequence of GPFS client cache thrashing. @boutigny, L. Tortay (GPFS expert) and myself decided to investigate more. Our aim is to understand what I/O operations are induced by some command line tasks of the stack. What follows is the report of what we have observed so far.
Our simplest possible test is just two steps:
- ingest a single raw image (one CCD) using
ingestImage.py
- use
processCcd.py
for processing a single CCD image
For this test we are using w_2017_29
running on CentOS 7.
Image ingestion
The command of our test is:
$ ingestImages.py data ./raw/HSCA01151351.fits --mode link
As a result of this, among other things, a file registry.sqlite3
is created in the data
directory. Both the data
directory and the raw
directory reside on GPFS. In the process of populating this file, the sqlite3 library creates some temporary files (with extensions -journal
or -wal
) in the same directory as the final destination file, i.e. data
. It repeatedly acquires and releases locks on those temporary files. When the population of the registry file is finished, the temporary file is renamed to its final name registry.sqlite3
.
Locking files or portion of files in a shared file system is a potentially costly operation. According to the SQLite documentation, there is a way to instruct the library where to create temporary files. It didnβt work in my tests: setting the values of one of the environmental variables SQLITE_TMPDIR
, TMPDIR
or TMP
had no effect when using a test program linking against the sqlite3 shared library shipped with the stack.
I would like to suggest the command line tasks of the stack to create any temporary file in the local storage of the compute node (either local scratch disk or even RAM disk) and to copy those files back to their final destination (i.e. the data
directory in GPFS in this particular case) when they are no longer needed. The command line tasks could look for the value of the POSIX TMPDIR
variable for the location to store those temporary files. Processing sites would then set the TMPDIR
variable to the appropriate location for each job to use for scratch storage.
Process CCD
The command for this test is:
$ processCcd.py input --output output --id visit=38080 ccd=23
There is nothing special (to our knowledge) with this particular visit and this particular CCD. The size of the input image file of CCD 23 is 17 MB.
At the end of the process several files are created under the output
directory:
$ tree output
output
βββ 01318
β βββ HSC-I
β βββ corr
β β βββ BKGD-0038080-023.fits
β β βββ CORR-0038080-023.fits
β βββ output
β β βββ ICSRC-0038080-023.fits
β β βββ SRC-0038080-023.fits
β β βββ SRCMATCH-0038080-023.fits
β β βββ SRCMATCHFULL-0038080-023.fits
β βββ processCcd_metadata
β β βββ 0038080-023.boost
β βββ thumbs
β βββ flattened-0038080-023.png
β βββ oss-0038080-023.png
βββ config
β βββ packages.pickle
β βββ processCcd.py
βββ repositoryCfg.yaml
βββ schema
βββ icSrc.fits
βββ src.fits
8 directories, 14 files
As in the previous step, we collected the I/O activity using strace
and then analysed its output. In the table below you can find the summary of the activity related to some of the files generated in this step. The SIZE column is the final size of the generated file, the READ column is the amount of data read using the read
system call when populating the file and analogously, the WRITE column is the amount of data written by the write
system call. All the numbers are in MBytes:
FILE SIZE (MB) READ (MB) WRITE (MB)
output/01318/HSC-I/output/ICSRC-0038080-023.fits 1 265 3
output/01318/HSC-I/output/SRC-0038080-023.fits 12 2299 24
output/01318/HSC-I/output/SRCMATCH-0038080-023.fits 0 0 0
output/01318/HSC-I/output/SRCMATCHFULL-0038080-023.fits 0 47 1
output/01318/HSC-I/corr/BKGD-0038080-023.fits 0 1 0
output/01318/HSC-I/corr/CORR-0038080-023.fits 98 13 98
output/schema/icSrc.fits 0 15 0
output/schema/src.fits
Notice that for instance, for generating the file SRC-0038080-023.fits
which has a final size of 12 MB (12876480 bytes), the process read 2299 MB, that is, 191 times its final size. In the same way, writing 1MB to the file ICSRC-0038080-023.fits
required reading 265 MB from it, or 265 times its size.
This looks really suspicious and probably unintended. If we look in detail what is happening at the file system level, we can see a pattern:
- write some FITS key-value pairs in the first HDU header (11520 bytes)
- set the file position to 0
- read all the contents of the file written so far
- write some data to the file (typically a FITS HDU, that is, 2880 bytes)
- set the file position to 0
- read all the contents of the file written so far
- write some data to the file (typically a FITS HDU, that is, 2880 bytes)
- set the file position to 0
- read all the contents of the file written so far
- and so onβ¦
It is not clear for me why it is necessary to re-read the whole file before each write operation. But if this is the intended behavior, could this be done also in a scratch area local to the compute node and copy the result to the final destination when appropriate? Given the sizes of the generated files, I donβt think we would be limited by local storage capacity at the compute node.
(If you want to see in detail the I/O activity of those 2 files, as reported by strace
, you can find the logs here.)
So, all this is for processing a single CCD. If we consider that the HSC dataset we are using has a total of 8.500 visits each composed of 104 CCDs, you may agree with me that we need to understand this.
Iβm happy to provide more details if necessary and to contribute with more tests to understand and eventually help making the stack more file system friendly.