Memory contiguity and python execution speed

We have a number of algorithms implemented as for loops in python (e.g. the star selectors) that could potentially be done as vectorized numpy expressions. These loops are potentially quite slow, but are necessary because afw tables are not guaranteed to be contiguous in memory, and thus we can’t do the usual (table['value'] == 0) & (table['value2'] > 5). We’re likely to run into some significant performance problems once we start running code like this on large catalogs.

The design of afw tables makes it hard to deal with them at the python level, but always doing deep copies to ensure contiguity has its own performance and memory problems.

One possible solution: a “@needs_contiguity” decorator that checks all the arguments of a function and makes deep copies if necessary to make them contiguous. This doesn’t help if the function needs to update values in-place, but could work well otherwise.

Another possibility would be the ability to use the numpy nditer iterator with afw.tables. The syntax would not be as clean as “proper” vectorized numpy statements, but it should be nearly as fast.

Other suggestions or ideas?

For things that are primitive types you can use

Unfortunately, modifying the catalog (e.g. to set a flag) is a very common use case and we should do nothing to discourage it. I hope a way can be found to allow fast access in Python to non-contiguous data. But until we have that, I suggest:

  • If the catalog is modified in place, then test for non-contiguous and raise an exception if not.
  • If the catalog is ready only, then consider copying it instead of raising (but raising might result in faster code, e.g. if one piece of code passes the catalog to multiple algorithms that copy it).

Asserting and copying are both easy, each requiring only a few lines of simple code. But even so, I guess there is a case for syntactic sugar: it’s a very common need, and using a decorator documents the requirement in a nice way.

I really don’t want to have a system that will do a copy behind my back, because that’s going to make for a lot more silent, lurking speed and memory problems. One of the features of ndarray is that it only copies when you tell it to copy, and that’s a good thing.

I don’t know the afw.table internals well enough to evaluate this myself, but are they not at least “piecewise contiguous”, as it were? Like STL deque? If so, for a piece of code as deep down in our system as afw.table, might it not be worth it to produce an implementation (admittedly, a complex one) that uses piecewise vectorization?

Do we know that this is a problem? For example, are the tables accessed in the starSelector code actually non-continuous, or could they be made contiguous by pre-allocating them (this assumes no deblender)?

In general I realise why loops are a bad idea, but has profiling shown this to be a problem in places that we care about?

I have a few thoughts here, representing some different possible solutions (or non-solutions):

  • The number of records in catalogs only changes at a few discrete points in the the processing. I think it’s almost always fixed after deblending is complete. If necessary, we could deep-copy at this stage to give all downstream stages access to contiguous tables.

  • I don’t understand how nditer could improve performance if it still involves running pure-Python code at every iteration. Is it really true that the overhead in Python loops comes from the loop itself, not the fact that the loop body involves Python code?

  • I think there are some operations that are more readable when expressed as NumPy expressions and some that are not. If performance isn’t actually a concern (@RHL’s question), we should be careful about blindly numpy-izing things, as I think row-based operations are frequently much more intuitive.

  • To @gpdf’s point: our tables are indeed usually piecewise contiguous, but even that isn’t guaranteed. I suppose we could do what you propose via a NumPy array-like object that’s piecewise contiguous, but that does indeed seem like a lot of work.

  • I’ve been wondering (vaguely) about whether we could add filtering methods to our tables that would accept actual SQL WHERE clauses (and do the filtering in C++). That’d obviously be a case of egregious overkill if it was just to solve this problem, but I think it provides some other benefits (filter expressions in config fields, code reuse in and outside the database) that might make it worthwhile, if we can find external libraries that would do most of the work for us.

  • We can always lift slow loops into C++. This should always be considered when profiling reveals a hotspot in Python code.

Or we could reserve the correct number of rows from the very start. We know how many peaks there are, and therefore the maximum size of the catalog.

Part of the problem is there’s no way right now to denote in the code the guarantee of a contiguous table, so any code that is somewhat generic needs to assume the table is non-contiguous and use loops. I think this means that even if we do make contiguous tables after a certain point, we will still need some semantics to make that guarantee clear.

Regarding speed, the argument has also been made before that we should not do something that’s obviously slow just because it’s convenient, since those decisions add up. It would be nice to know how frequently these small loops over all table records appear in python, to guide whether we need a point solution in a few spots or a more global solution.

I agree that would be helpful. Documentation might suffice, and the code to raise if not contiguous is simple and clear. Nonetheless, a decorator that raises if the catalog is not contiguous might add useful clarity.

I don’t have an answer to the first part of your question: my jointcal test uses fake catalogs that are contiguous, since I just created them before passing them to jointcal. I don’t know what sort of catalogs would get passed to it in full production, or whether we could guarantee their contiguity from that point forward.

To the second part of your question: the new astrometrySourceSelector takes about 5 seconds to extract ~1000 sources from a catalog of ~5000. I can reduce that to about 0.4 seconds by replacing one line that is essentially for source in srcCat: any(source.get(stuff)) with reduce(np.somefunc, srcCat.get(stuff)). I haven’t even looked for further speedups, but cProfile tells me a lot of time is being spent in methods that could be easily vectorized over, so there might be another factor of 10 or so hiding in there. But that reduce doesn’t work if the array is non-contiguous because of the numpy expression that’s doing the actual work.

That’s a good answer to the second part; yes, we do have to care. I’m concerned that it’s still 100 microseconds per source (10^5 cycles!), but as you say, there are further opportunities for optimisation.

Further followup: I vectorized the whole thing (not necessarily in the most optimal way, just exactly following the existing code), and it’s now down to about 10ms per catalog. So, a factor of about 500x vectorized vs. for loops. I’m likely creating several numpy temporaries in the process, but they’re almost all bool arrays, so only one byte per source.

This strongly suggests that requiring contiguity here is a _Good Idea_™. Colin’s decorator that excepts immediately if the catalog is not contiguous seems like a good starting point.


So, given this discussion and the massive speed improvement I got plus Russell’s discussion in this community post about catalog.append, do we have a recommendation about how we should manage this sort of thing? It is probably worth a discussion session at the all hands meeting. Each of the proposals has its own set of problems.

The astrometrySourceSelector I just wrote currently has two code branches: one for contiguous and one for non-contiguous catalogs. I’m leaving the latter (very slow) code in place for now so that it at least works on non-contiguous catalogs, while we sort this out.

Furthermore, objectSizeStarSelector (which doesn’t have tests) will raise a lsst::pex::exceptions::RuntimeError: 'Record data is not contiguous in memory.' exception on non-contiguous catalogs because it uses numpy for some comparisons.

Sorry I’ve been silent lately here.

So far I think the most intriguing solution is @price’s:

I think he’s right: we really can know the maximum size of the catalog (and in many contexts, the true size of the catalog) fairly early. If we reserve the right amount of space at that point, everything downstream that operates on the full catalog can assume contiguousness (operations on already-filtered subsets still can’t). We might be able to just invert the expectation, and make contiguousness common enough that most operations can assert it, and we can specialize the few that are run in contexts where that assert would fail. In fact, I wonder if we could get to a point where we don’t even need to have support for noncontiguous catalogs at all in afw.table; that’d obviously be a huge simplification to the library.

I agree. However, we also need to solve the issue of filtered subsets, which are all too common. @parejkoj suggested that we use boolean arrays or indexed arrays instead of catalog subsets, so that we can continue to apply operations to the full catalog. I worry about the inefficiency of working with small subsets of large catalogs, but it clearly is better than what we are doing now. In this context star selectors would return an index array rather than a catalog.

This sounds reasonable to me. It also makes me wonder if it’d be useful to have multiple container classes for Records (that was my original plan back in the early days of afw.table), including perhaps one that guaranteed contiguousness and one that combined that with an index array. Of course, the latter sounds suspiciously like masked_array, which (I hear) is a bit of a disaster, so maybe it’s not such a good idea.