afwTable performance degradation using brackets vs. explicit .get() for fields

erykoff · September 10, 2017, 5:36am

There appears to be a gigantic performance difference between using sourceCat['field'] and sourceCat.get('field') that (I believe) scales with the width of the table.
It seems as though the first call with square brackets might be converting the whole table (though this is just conjecture on my part).
For a tiny table, here is a timing comparison of accessing the elements in a small table:

import lsst.afw.table as afwTable
import numpy as np

schema = afwTable.Schema()
schema.addField('col1',type=np.float32,doc="column 1")
schema.addField('col2',type=np.float32,doc="column 2")
schema.addField('col3',type=np.float32,doc="column 3")

testCat = afwTable.BaseCatalog(schema)
nRow = 10
testCat.table.preallocate(nRow)

for i in xrange(nRow):
    rec=testCat.addNew()
    rec['col1'] = i
    rec['col2'] = i*10
    rec['col3'] = i*100

% timeit test=testCat['col1']
## 2.21 ms
% timeit test=testCat.get('col1')
## 34.1 us

That’s a factor of 60 for this little toy case. When doing a source selection like the following:

gdFlag = np.logical_and.reduce([~sources['flag_pixel_saturated_center'],
                                    ~sources['flag_pixel_interpolated_center'],
                                    ~sources['flag_pixel_edge'],
                                    ~sources['flag_pixel_cr_center'],
                                    ~sources['flag_pixel_bad'],
                                    ~sources['flag_pixel_interpolated_any'],
                                    ~sources['slot_Centroid_flag'],
                                    ~sources['slot_ApFlux_flag'],
                                    sources['deblend_nchild'] == 0,
                                    sources['parent'] == 0,
                                    sources['classification_extendedness'] < 0.5])

vs

gdFlag = np.logical_and.reduce([~sources.get('flag_pixel_saturated_center'),
                                    ~sources.get('flag_pixel_interpolated_center'),
                                    ~sources.get('flag_pixel_edge'),
                                    ~sources.get('flag_pixel_cr_center'),
                                    ~sources.get('flag_pixel_bad'),
                                    ~sources.get('flag_pixel_interpolated_any'),
                                    ~sources.get('slot_Centroid_flag'),
                                    ~sources.get('slot_ApFlux_flag'),
                                    sources.get('deblend_nchild') == 0,
                                    sources.get('parent') == 0,
                                    sources.get('classification_extendedness') < 0.5])

Then the speed difference is a factor of approximately 12000. In one case it takes ~10 seconds (for a fully populated source catalog) and in the other 822 microseconds.
For reference, doing it the way the sourceSelector does (select each row and append one at a time) is somewhere in the middle … from intolerable to tolerable, but not as good as the vectorized selection through numpy.

Notably, for an individual source/row there does not seem to be any performance difference between the [] call and the .get() call.

Relatedly, when looping over a source catalog to set values, if you do:

for i in xrange(nRow):
    tempCat['field'][i] = computed_value

then it will be super-slow, but

for i in xrange(nRow):
    tempCat[i]['field'] = computed_value

then it will be fast because you use the setter on an individual row (fast) vs trying to grab the full column in a numpy array (slow). This is opposite to what one does for a regular numpy recarray.

Finally, there does not seem to be a corresponding fast table-wide .set() call to go with the fast .get() call so at the moment you’re stuck with either looping over the full table (as above), or taking the performance hit.

parejkoj · September 10, 2017, 6:21am

See DM-6782 for your first comment.

Have you tried it using the record Key? Unless something’s changed recently, I’d expect that to be even faster than your sources.get("string") version.

As to your last question, I swore there was a ticket on this, but I can’t seem to find it. Related tickets are DM-5440 and DM-10380. Unless someone else can find it, we should file that request as a ticket (“Add table.set(np.array)” or somesuch).

jbosch · September 10, 2017, 2:40pm

The DM-6782 issue - that Key access is faster than string access - is a known one(and at some level is a feature, not a bug - while string access could be much faster than it is now, Key access will always be fastest).

But __getitem__ behaving differently from get is definitely a previously-unknown issue, and I suspect it’s caused many misdiagnoses of past performance problems. Those two methods should mostly follow the same code path, with the complications being:

__getitem__ has to first check whether the argument is an integer, a slice, or a boolean array, which invokes row-based indexing. I suspect some of this dispatch is happening through pybind11-wrapped overloaded C++ functions, which may be quite a bit slower than a simple if in Python.
Both of these methods are actually defined on the ColumnView class, not the Catalog, and are invoked via __getattr__ forwarding, I think.

I’ve created DM-11871 for this. If you’re in a position to easily run a profiler on the extended example in particular, that’d be quite helpful in estimating how hard this will be to fix, as I bet it’ll be enough for me to guess what’s going wrong.

erykoff · September 10, 2017, 3:45pm

I’ll be able to do a quick profile on Monday at the latest, but if it’s in the C++ functions and not the python layer I don’t know if I’ll be able to say much.

I thought for a second that .get() was read-only, but I was wrong. I can do tempCat.get('field')[:] = array and it is nice and performant. This (to my eye) is awkward, and I certainly wouldn’t have thought to use this syntax.

And looking at @parejkoj’s ticket above, I think that the string look-up, while a bit of a performance hit, is a red herring. The essential performance gain is the replacement of the brackets with .get().

parejkoj · September 10, 2017, 5:31pm

I wonder if the [string] vs. .get(string) had a performance change going from swig to pybind11? Running the field_key_timing.py file I put in DM-6782 now, I get a similar amount of time for the .get(key) version, but a much faster time than before for the [string] version (~200ms in the ticket, ~15ms now). And I’ve added a .get(string) test to it (see the ticket), and it comes out with the same timing as the [string] version: I wonder why you’re seeing a difference in the test you wrote above?

$ ipython field_key_timing.py
Using matplotlib backend: MacOSX
Small schema, string lookup
100 loops, best of 3: 14 ms per loop
Small schema, string-get lookup
100 loops, best of 3: 13.1 ms per loop
Small schema, asKey() lookup
100 loops, best of 3: 4.07 ms per loop

Large schema, string lookup
100 loops, best of 3: 14.7 ms per loop
Small schema, string-get lookup
100 loops, best of 3: 14.8 ms per loop
Large schema, asKey() lookup
100 loops, best of 3: 4.15 ms per loop

I ran your test above and did see a similar difference like you saw, although I got the %timeit caching warning, which might be important. Why would this version behave differently than the one I put in the ticket?

In [3]: % timeit test=testCat['col1']
The slowest run took 163.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 1.3 ms per loop

In [4]: % timeit test=testCat.get('col1')
The slowest run took 4.48 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 18.3 µs per loop

erykoff · September 10, 2017, 6:18pm

Just looked at your test code, which might also point to a clue. Your test does a loop: [x['parent'] for x in src]. Note that the performance of src[i]['field'] is the same as src[i].get('field') as far as I can tell. It’s only when you do the array access src['field'] that you get the giant performance hit in the current version of the stack.

So it may be that the pybind11 upgrade fixed some performance issues for the individual sources, but not for the full array.

Also note that my timing comes out in microseconds vs. milliseconds. It’s still much faster to do the full array access with .get('field') than to loop over individual sources (loops are evil!), but that’s still much faster than the broken ['field'] call.

erykoff · September 10, 2017, 11:30pm

I ran a quick cProfile on the runs, and it was illuminating. With the .get() call, it’s very short and sweet:

In [7]: cProfile.run("test=testCat.get('col1')")
         11 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 baseColumnViewContinued.py:58(__getitem__)
        1    0.000    0.000    0.000    0.000 baseContinued.py:106(__getColumns)
        1    0.000    0.000    0.000    0.000 baseContinued.py:319(__getattr__)
        1    0.000    0.000    0.000    0.000 {built-in method _basicget}
        1    0.000    0.000    0.000    0.000 {built-in method find}
        2    0.000    0.000    0.000    0.000 {getattr}
        1    0.000    0.000    0.000    0.000 {hasattr}
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

With the bracket-based call, it’s very complex, and goes through the asAstropy interface and a whole lot else:

In [6]: cProfile.run("test=testCat['col1']")
         2320 function calls (2269 primitive calls) in 0.004 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.004    0.004 <string>:1(<module>)
       19    0.000    0.000    0.000    0.000 _abcoll.py:548(update)
       43    0.000    0.000    0.000    0.000 _weakrefset.py:70(__contains__)
       17    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
       19    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
        4    0.000    0.000    0.000    0.000 baseColumnViewContinued.py:58(__getitem__)
        4    0.000    0.000    0.000    0.000 baseContinued.py:106(__getColumns)
        1    0.000    0.000    0.004    0.004 baseContinued.py:112(__getitem__)
        1    0.000    0.000    0.003    0.003 baseContinued.py:225(asAstropy)
        1    0.000    0.000    0.000    0.000 baseContinued.py:319(__getattr__)
        1    0.000    0.000    0.004    0.004 baseContinued.py:331(__str__)
        1    0.000    0.000    0.004    0.004 baseContinued.py:339(__repr__)
        1    0.000    0.000    0.000    0.000 collections.py:121(values)
        1    0.000    0.000    0.000    0.000 collections.py:125(items)
        4    0.000    0.000    0.000    0.000 collections.py:133(itervalues)
        6    0.000    0.000    0.000    0.000 collections.py:194(__reduce__)
       19    0.000    0.000    0.000    0.000 collections.py:50(__init__)
        6    0.000    0.000    0.000    0.000 collections.py:71(__setitem__)
       18    0.000    0.000    0.000    0.000 collections.py:90(__iter__)
        6    0.000    0.000    0.002    0.000 column.py:103(__new__)
        3    0.000    0.000    0.000    0.000 column.py:156(data)
       18    0.000    0.000    0.000    0.000 column.py:160(parent_table)
        9    0.000    0.000    0.000    0.000 column.py:167(parent_table)
        6    0.000    0.000    0.001    0.000 column.py:268(__array_finalize__)
        1    0.000    0.000    0.000    0.000 column.py:31(_auto_names)
       12    0.000    0.000    0.000    0.000 column.py:312(name)
        6    0.000    0.000    0.000    0.000 column.py:319(name)
        6    0.000    0.000    0.000    0.000 column.py:490(unit)
       12    0.000    0.000    0.000    0.000 column.py:501(unit)
        6    0.000    0.000    0.001    0.000 column.py:620(_copy_attrs)
        6    0.000    0.000    0.002    0.000 column.py:695(__new__)
   126/87    0.000    0.000    0.001    0.000 column.py:709(__setattr__)
       13    0.000    0.000    0.000    0.000 configobj.py:553(__getitem__)
        5    0.000    0.000    0.000    0.000 configuration.py:275(__get__)
        5    0.000    0.000    0.000    0.000 configuration.py:380(__call__)
        5    0.000    0.000    0.000    0.000 configuration.py:447(_validate_val)
       10    0.000    0.000    0.000    0.000 configuration.py:481(get_config)
        2    0.000    0.000    0.000    0.000 console.py:120(isatty)
        1    0.000    0.000    0.000    0.000 console.py:146(terminal_size)
        1    0.000    0.000    0.000    0.000 console.py:89(_get_stdout)
    36/24    0.000    0.000    0.001    0.000 copy.py:145(deepcopy)
       12    0.000    0.000    0.000    0.000 copy.py:226(_deepcopy_list)
        6    0.000    0.000    0.000    0.000 copy.py:234(_deepcopy_tuple)
       12    0.000    0.000    0.000    0.000 copy.py:253(_deepcopy_dict)
       36    0.000    0.000    0.000    0.000 copy.py:267(_keep_alive)
        6    0.000    0.000    0.000    0.000 copy.py:306(_reconstruct)
        6    0.000    0.000    0.000    0.000 data_info.py:168(__init__)
       54    0.000    0.000    0.000    0.000 data_info.py:172(<genexpr>)
       27    0.000    0.000    0.000    0.000 data_info.py:174(__get__)
       24    0.000    0.000    0.000    0.000 data_info.py:204(__getattr__)
       36    0.000    0.000    0.000    0.000 data_info.py:226(__setattr__)
        3    0.000    0.000    0.000    0.000 fnmatch.py:69(fnmatchcase)
        3    0.000    0.000    0.000    0.000 metadata.py:378(__get__)
       13    0.000    0.000    0.000    0.000 metadata.py:385(__set__)
        2    0.000    0.000    0.000    0.000 misc.py:38(isiterable)
       15    0.000    0.000    0.000    0.000 np_utils.py:562(fix_column_name)
        4    0.000    0.000    0.000    0.000 pprint.py:120(_get_pprint_size)
        3    0.000    0.000    0.001    0.000 pprint.py:173(_pformat_col)
       30    0.000    0.000    0.000    0.000 pprint.py:22(default_format_func)
       39    0.000    0.000    0.000    0.000 pprint.py:226(<genexpr>)
       39    0.000    0.000    0.000    0.000 pprint.py:250(<genexpr>)
       39    0.000    0.000    0.000    0.000 pprint.py:295(_pformat_col_iter)
        1    0.000    0.000    0.001    0.001 pprint.py:400(_pformat_table)
        4    0.000    0.000    0.000    0.000 pprint.py:490(<genexpr>)
        1    0.000    0.000    0.000    0.000 pprint.py:490(<lambda>)
       48    0.000    0.000    0.000    0.000 pprint.py:535(<genexpr>)
        2    0.000    0.000    0.000    0.000 process.py:161(name)
        2    0.000    0.000    0.000    0.000 process.py:59(current_process)
        3    0.000    0.000    0.000    0.000 re.py:192(compile)
        3    0.000    0.000    0.000    0.000 re.py:230(_compile)
        1    0.000    0.000    0.000    0.000 schemaContinued.py:112(__iter__)
        1    0.000    0.000    0.000    0.000 schemaContinued.py:157(extract)
        1    0.000    0.000    0.000    0.000 six.py:552(itervalues)
        1    0.000    0.000    0.001    0.001 table.py:1048(pformat)
        3    0.000    0.000    0.000    0.000 table.py:109(__setitem__)
        7    0.000    0.000    0.000    0.000 table.py:1288(masked)
        2    0.000    0.000    0.000    0.000 table.py:1297(_set_masked)
        6    0.000    0.000    0.000    0.000 table.py:1328(ColumnClass)
        1    0.000    0.000    0.000    0.000 table.py:133(values)
        1    0.000    0.000    0.002    0.002 table.py:242(__init__)
        2    0.000    0.000    0.000    0.000 table.py:302(<genexpr>)
        1    0.000    0.000    0.000    0.000 table.py:555(_check_names_dtype)
        2    0.000    0.000    0.000    0.000 table.py:568(_set_masked_from_cols)
        4    0.000    0.000    0.000    0.000 table.py:570(<genexpr>)
        1    0.000    0.000    0.000    0.000 table.py:575(<genexpr>)
        1    0.000    0.000    0.002    0.002 table.py:596(_init_from_list)
        2    0.000    0.000    0.000    0.000 table.py:600(<genexpr>)
        3    0.000    0.000    0.000    0.000 table.py:683(_convert_col_for_table)
        1    0.000    0.000    0.000    0.000 table.py:694(_init_from_cols)
        4    0.000    0.000    0.000    0.000 table.py:697(<genexpr>)
        1    0.000    0.000    0.000    0.000 table.py:733(_make_table_from_cols)
        4    0.000    0.000    0.000    0.000 table.py:738(<genexpr>)
        2    0.000    0.000    0.000    0.000 table.py:74(__init__)
        4    0.000    0.000    0.000    0.000 table.py:744(<genexpr>)
        1    0.000    0.000    0.001    0.001 table.py:814(__unicode__)
        1    0.000    0.000    0.001    0.001 table.py:819(__bytes__)
        6    0.000    0.000    0.000    0.000 table.py:87(__getitem__)
        2    0.000    0.000    0.000    0.000 threading.py:1029(getName)
        2    0.000    0.000    0.000    0.000 threading.py:1143(currentThread)
        2    0.000    0.000    0.000    0.000 threading.py:958(name)
        3    0.000    0.000    0.000    0.000 validate.py:1073(is_string)
        5    0.000    0.000    0.000    0.000 validate.py:593(check)
        5    0.000    0.000    0.000    0.000 validate.py:637(_parse_with_caching)
        3    0.000    0.000    0.000    0.000 validate.py:651(_check_value)
        3    0.000    0.000    0.000    0.000 validate.py:746(_is_num_param)
        1    0.000    0.000    0.000    0.000 {_struct.pack}
        1    0.000    0.000    0.000    0.000 {_struct.unpack}
        2    0.000    0.000    0.000    0.000 {all}
        3    0.000    0.000    0.000    0.000 {any}
        4    0.000    0.000    0.000    0.000 {built-in method _basicget}
        1    0.000    0.000    0.004    0.004 {built-in method _getitem_}
        1    0.000    0.000    0.000    0.000 {built-in method find}
        1    0.000    0.000    0.000    0.000 {built-in method forEach}
        3    0.000    0.000    0.000    0.000 {built-in method getAliasMap}
        3    0.000    0.000    0.000    0.000 {built-in method getDoc}
        1    0.000    0.000    0.000    0.000 {built-in method getMetadata}
        3    0.000    0.000    0.000    0.000 {built-in method getName}
       12    0.000    0.000    0.000    0.000 {built-in method getTypeString}
        3    0.000    0.000    0.000    0.000 {built-in method getUnits}
        1    0.000    0.000    0.000    0.000 {built-in method isContiguous}
        3    0.000    0.000    0.000    0.000 {built-in method items}
        6    0.000    0.000    0.000    0.000 {callable}
        1    0.000    0.000    0.000    0.000 {fcntl.ioctl}
      167    0.000    0.000    0.000    0.000 {getattr}
       28    0.000    0.000    0.000    0.000 {hasattr}
      132    0.000    0.000    0.000    0.000 {id}
      275    0.000    0.000    0.000    0.000 {isinstance}
       18    0.000    0.000    0.000    0.000 {issubclass}
        4    0.000    0.000    0.000    0.000 {iter}
      166    0.000    0.000    0.000    0.000 {len}
        6    0.000    0.000    0.000    0.000 {max}
        6    0.000    0.000    0.000    0.000 {method '__reduce_ex__' of 'object' objects}
       53    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        3    0.000    0.000    0.000    0.000 {method 'center' of 'unicode' objects}
        6    0.000    0.000    0.000    0.000 {method 'copy' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'encode' of 'unicode' objects}
        3    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
      118    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'isatty' of 'file' objects}
       19    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
       12    0.000    0.000    0.000    0.000 {method 'iteritems' of 'dict' objects}
       23    0.000    0.000    0.000    0.000 {method 'join' of 'unicode' objects}
        3    0.000    0.000    0.000    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
       15    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
       36    0.000    0.000    0.000    0.000 {method 'rjust' of 'unicode' objects}
        2    0.000    0.000    0.000    0.000 {method 'rsplit' of 'str' objects}
        3    0.000    0.000    0.000    0.000 {method 'rsplit' of 'unicode' objects}
        5    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
       10    0.000    0.000    0.000    0.000 {method 'split' of 'unicode' objects}
       57    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        9    0.000    0.000    0.001    0.000 {method 'view' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.000    0.000 {numpy.core.multiarray.arange}
        6    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
        8    0.000    0.000    0.000    0.000 {range}
       27    0.000    0.000    0.000    0.000 {setattr}
        1    0.000    0.000    0.000    0.000 {sum}
        2    0.000    0.000    0.000    0.000 {thread.get_ident}
       12    0.000    0.000    0.000    0.000 {vars}
        4    0.000    0.000    0.000    0.000 {zip}

The calls on individual records are the same whether or not you use .get() or []:

In [9]: cProfile.run("test=testCat[0]['col1']")
         5 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 baseContinued.py:112(__getitem__)
        1    0.000    0.000    0.000    0.000 {built-in method _getitem_}
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

jbosch · September 11, 2017, 5:29pm

Here’s what’s going on:

__getitem__ tries to see if the given argument can be used to index rows instead of columns by calling an overloaded, pybind11-wrapped function inside a try block.
pybind11 tries all of the overloads to see if any of them match. None of them do. This is still all pretty fast, I think, though it might be good to avoid it.
pybind11 composes a “helpful” error message declaring that the function arguments are bad, which includes printing the arguments that were tried and hence invoking __str__ on self. That invokes the new AstroPy-based pretty-printing, etc. This is what’s slow.
We catch the exception and fall back to column access.

I can and will fix this now by explicitly checking for str and Key arguments instead of try/catch when determining whether we want row or column indexing.

But I don’t like the broader problem that you can’t rely on try/except to be a fast test when pybind11 wrappers and slow __str__ are involved. I imagine pybind11 is expecting __str__ to be smaller and more efficient than __repr__, which is not unreasonable - but this is just the same old ambiguity about exactly what Python stringification methods are supposed to do. @pschella, I wonder if we should push for some pybind11 hook for controlling if/how to stringify arguments when building exception strings. Normally we’d like a verbose message that includes the argument that was passed, but this is much too verbose.

pschella · September 11, 2017, 5:58pm

Yes, we should definitely push this upstream. Would it be sufficient to just get upstream to default to __repr__?

timj · September 11, 2017, 6:13pm

I think there’s a risk that __repr__ could be more verbose than __str__ if you interpret it as "what you use to represent the object such that it can be reconstructed using eval()". Will you be okay in this case because __repr__ of pybind11 objects is always opaque and short?

jbosch · September 11, 2017, 6:19pm

Sadly, since there’s no real recommendation from the Python community on __str__ vs __repr__ on the string-length axis, I don’t think always using either in exception formatting can be safe.