Observations on I/O activity induced by ingestImages.py and processCcd.py

FabioHernandez · August 3, 2017, 3:31pm

@boutigny has been recently and repeatedly processing HSC data using the LSST stack on the batch farm of CC-IN2P3. Both the raw HSC images and the resulting reduced images are stored in our GPFS cluster.

During those processing periods we have observed low CPU efficiency (i.e. the ratio CPU time/wall time) of most of @boutigny’s jobs which we think could be a consequence of GPFS client cache thrashing. @boutigny, L. Tortay (GPFS expert) and myself decided to investigate more. Our aim is to understand what I/O operations are induced by some command line tasks of the stack. What follows is the report of what we have observed so far.

Our simplest possible test is just two steps:

ingest a single raw image (one CCD) using ingestImage.py
use processCcd.py for processing a single CCD image

For this test we are using w_2017_29 running on CentOS 7.

Image ingestion

The command of our test is:

$ ingestImages.py data ./raw/HSCA01151351.fits --mode link

As a result of this, among other things, a file registry.sqlite3 is created in the data directory. Both the data directory and the raw directory reside on GPFS. In the process of populating this file, the sqlite3 library creates some temporary files (with extensions -journal or -wal) in the same directory as the final destination file, i.e. data. It repeatedly acquires and releases locks on those temporary files. When the population of the registry file is finished, the temporary file is renamed to its final name registry.sqlite3 .

Locking files or portion of files in a shared file system is a potentially costly operation. According to the SQLite documentation, there is a way to instruct the library where to create temporary files. It didn’t work in my tests: setting the values of one of the environmental variables SQLITE_TMPDIR, TMPDIR or TMP had no effect when using a test program linking against the sqlite3 shared library shipped with the stack.

I would like to suggest the command line tasks of the stack to create any temporary file in the local storage of the compute node (either local scratch disk or even RAM disk) and to copy those files back to their final destination (i.e. the data directory in GPFS in this particular case) when they are no longer needed. The command line tasks could look for the value of the POSIX TMPDIR variable for the location to store those temporary files. Processing sites would then set the TMPDIR variable to the appropriate location for each job to use for scratch storage.

Process CCD

The command for this test is:

$ processCcd.py input --output output --id visit=38080 ccd=23

There is nothing special (to our knowledge) with this particular visit and this particular CCD. The size of the input image file of CCD 23 is 17 MB.

At the end of the process several files are created under the output directory:

$ tree output
output
├── 01318
│   └── HSC-I
│       ├── corr
│       │   ├── BKGD-0038080-023.fits
│       │   └── CORR-0038080-023.fits
│       ├── output
│       │   ├── ICSRC-0038080-023.fits
│       │   ├── SRC-0038080-023.fits
│       │   ├── SRCMATCH-0038080-023.fits
│       │   └── SRCMATCHFULL-0038080-023.fits
│       ├── processCcd_metadata
│       │   └── 0038080-023.boost
│       └── thumbs
│           ├── flattened-0038080-023.png
│           └── oss-0038080-023.png
├── config
│   ├── packages.pickle
│   └── processCcd.py
├── repositoryCfg.yaml
└── schema
    ├── icSrc.fits
    └── src.fits

8 directories, 14 files

As in the previous step, we collected the I/O activity using strace and then analysed its output. In the table below you can find the summary of the activity related to some of the files generated in this step. The SIZE column is the final size of the generated file, the READ column is the amount of data read using the read system call when populating the file and analogously, the WRITE column is the amount of data written by the write system call. All the numbers are in MBytes:

FILE                                                               SIZE (MB)  READ (MB) WRITE (MB)
output/01318/HSC-I/output/ICSRC-0038080-023.fits                        1        265          3
output/01318/HSC-I/output/SRC-0038080-023.fits                         12       2299         24
output/01318/HSC-I/output/SRCMATCH-0038080-023.fits                     0          0          0
output/01318/HSC-I/output/SRCMATCHFULL-0038080-023.fits                 0         47          1
output/01318/HSC-I/corr/BKGD-0038080-023.fits                           0          1          0
output/01318/HSC-I/corr/CORR-0038080-023.fits                          98         13         98
output/schema/icSrc.fits                                                0         15          0
output/schema/src.fits

Notice that for instance, for generating the file SRC-0038080-023.fits which has a final size of 12 MB (12876480 bytes), the process read 2299 MB, that is, 191 times its final size. In the same way, writing 1MB to the file ICSRC-0038080-023.fits required reading 265 MB from it, or 265 times its size.

This looks really suspicious and probably unintended. If we look in detail what is happening at the file system level, we can see a pattern:

write some FITS key-value pairs in the first HDU header (11520 bytes)
set the file position to 0
read all the contents of the file written so far
write some data to the file (typically a FITS HDU, that is, 2880 bytes)
set the file position to 0
read all the contents of the file written so far
write some data to the file (typically a FITS HDU, that is, 2880 bytes)
set the file position to 0
read all the contents of the file written so far
and so on…

It is not clear for me why it is necessary to re-read the whole file before each write operation. But if this is the intended behavior, could this be done also in a scratch area local to the compute node and copy the result to the final destination when appropriate? Given the sizes of the generated files, I don’t think we would be limited by local storage capacity at the compute node.

(If you want to see in detail the I/O activity of those 2 files, as reported by strace, you can find the logs here.)

So, all this is for processing a single CCD. If we consider that the HSC dataset we are using has a total of 8.500 visits each composed of 104 CCDs, you may agree with me that we need to understand this.

I’m happy to provide more details if necessary and to contribute with more tests to understand and eventually help making the stack more file system friendly.

ktl · August 3, 2017, 3:46pm

The temporary files that SQLite places in a separate directory are for things like materialized subqueries, temporary tables, etc. The -journal and -wal files, on the other hand, are always in the same directory as the database file itself. I think they have to have many of the same characteristics, particularly shareability, in order to enable multiprocess access. These are not single-process temporary files and so cannot be placed in node-local or process-local storage.

As to the FITS file I/O, I’m not sure whether this is from cfitsio or our usage of it. That would have to be investigated further.

jbosch · August 3, 2017, 3:48pm

Reading the contents of the file we’ve just written is definitely not intentional - could that just be us seeking back to the end of the file? If so, it looks like we’re expecting I/O library to reduce “set position to zero and then seek to the end” to a no-op, but I don’t know if yet if that’s something internal to CFITSIO or something we’re telling it do do.

jbosch · August 3, 2017, 3:56pm

Looking a bit closer (sorry for the double post), we have a few calls to CFITSIO’s fits_move_abs_hdu and fits_get_num_hdus between writing most HDUs. I had previously thought that CFITSIO was tracking in its in-memory data structure what HDU we were on and how many there were, and it didn’t occur to me that moving to a new HDU without reading or writing anything would incur any I/O. I bet that’s what needs to be fixed.

ktl · August 3, 2017, 3:56pm

P.S. The multiprocess sharing nature of the .sqlite3, -journal, and -wal files is precisely why locking is necessary at all. We could try to reduce the number of transactions we perform against the registry in order to reduce the number of locking calls.

FabioHernandez · August 3, 2017, 3:57pm

It may be very well some CFITSIO feature. The pattern is:

8756  11:25:42.397236 lseek(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, 0, SEEK_SET) = 0 <0.000011>
8756  11:25:42.397272 read(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, "SIMPLE  =                    T /"..., 1048576) = 14400 <0.000018>
8756  11:25:42.397324 write(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, "TCCLS23 = 'Scalar  '           /"..., 2880) = 2880 <0.000016>
8756  11:25:42.397370 lseek(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, 0, SEEK_SET) = 0 <0.000010>
8756  11:25:42.397406 read(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, "SIMPLE  =                    T /"..., 1048576) = 17280 <0.000012>
8756  11:25:42.397446 lseek(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, 17280, SEEK_SET) = 17280 <0.000011>
8756  11:25:42.397492 lseek(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, 0, SEEK_SET) = 0 <0.000011>
8756  11:25:42.397527 read(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, "SIMPLE  =                    T /"..., 1048576) = 17280 <0.000013>
8756  11:25:42.397568 lseek(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, 0, SEEK_SET) = 0 <0.000011>
8756  11:25:42.397604 read(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, "SIMPLE  =                    T /"..., 1048576) = 17280 <0.000012>
8756  11:25:42.397644 write(8</sps/lsst/dev/fabio/tests/processCcd/output/01318/HSC-I/output/ICSRC-0038080-023.fitsg7sk5_kc>, "TFLAG16 = 'base_SdssShape_flag_s"..., 2880) = 2880 <0.000014>

timj · August 3, 2017, 3:57pm

I’m looking forward to testing this with HDF5.

FabioHernandez · August 3, 2017, 4:00pm

Yes, I understand that. I see benefits of having a registry associated to the directory where data is located. But then I wonder if sqlite is the good technology for populating a shared registry in parallel: I we have 1000 compute nodes populating a single registry I would rather use a central RDMBS and when the process is finished to copy the data back to a sqlite file which can then be used read-only.

ktl · August 3, 2017, 4:01pm

The SQLite registry was always intended to only be one implementation of a registry. A shared central RDBMS server has always been considered to be another alternative but has not yet been implemented. (Edited to add: Except that I forgot that it has been implemented by @price for PostgreSQL.)

MichelleGower · August 3, 2017, 4:17pm

If this implemented in command-line tasks, it should be easily turned off (prefer via command-line or config as opposed to unsetting/modifying system set TMPDIR) in the cases where a supervisory/processing framework is responsible for placing input files into the correct locations and saving output files to post-job locations. Also if done inside command-line tasks and assuming enough local disk space, with a simple implementation there will be extra file movement that can be avoided via the supervisory/processing framework (e.g., output of one command-line task moved back to shared location, next command-line task in same compute job must recopy the file back to local disk for its input or be smart enough to avoid this)

FabioHernandez · August 3, 2017, 4:38pm

Although this may very well work when we will be in steady state, the supervisor would need to integrate all the data placement knowledge which is in the command line tasks, in order to prepare the execution environment for a particular job. This is possible, but I think those decisions (e.g. where files need to be placed in the hierarchy, what are all the required input files for processing a particular file) may still change until operations, and even after.

I’m not worried about local storage space. If you see the individual file sizes, I don’t expect the local storage capacity would be the limiting factor. It is maybe possible that streaming the files from the shared networked file system to the local file system is cheaper than doing the small I/O operations directly on the networked file system. But I agree that it is desirable to avoid copying the files back and forth unnecessarily.

price · August 3, 2017, 4:40pm

For details on using a PostgreSQL registry, see DM-10267 and LDM-463.

jbosch · August 3, 2017, 4:44pm

Solving this problem programmatically is an explicit goal of SuperTask.

FabioHernandez · August 10, 2017, 8:05am

For your information, these observations are also in the form of a technical note DMTN-053 available at https://dmtn-053.lsst.io.