I am currently migrating a workflow from the RSP Jupyter portal into batch processing on NERSC. I have run into the issue that the TAP service has rate limits and so my workflow (which involves accessing various Object, Visit, etc tables) multiplied by N jobs is immediately given an error for overloading TAP. I had understood that the reason for having the Rubin data and the compute right next to each other (i.e. on the RSP and NERSC) was to mitigate the effects of data transfer through the internet and to avoid these kinds of strong rate limits.
Should I rewrite everything to work with the Butler? Does that behave better at directly accessing the data on NERSC rather than going through a portal? Because I have looked into that and the Butler seems much less efficient than TAP (you often have to download whole columns and do selection client side rather than server side). As another point, beyond the reduced efficiency of loading extra data in memory, is that using the Butler to reproduce TAP style ADQL queries requires much more code. I understand the Butler has access to the tables/catalogues, so are there plans to add ADQL queries as an option?
As a final note, it was very jarring when migrating from the RSP to NERSC (and presumably any other IDAC) that the lsst stack is different on the RSP than anywhere else. I loaded the exact same version on the RSP and NERSC (v29.2.0) and I was unable to do from lsst.rsp import get_tap_service. I think if there are going to be RSP specific tools then those should have their own package, rather than being added in to the lsst stack sometimes.
All of this is conditioned on me being very new to all this, so perhaps I am missing something and my above points (except the lsst.rsp thing) are moot.
Hi @plazas I think my username is just connorstone. I had 10 jobs running simultaneously so there’s a good chance the TAP service got 10 calls all within a few miliseconds of each other. Though I intend to scale up quite a bit (maybe hundreds or low thousands for now).
So that is not the only reason rate (and other resource) limits exist. Some other reasons are maintaining fairness to make sure one user does not consume all other resources to the detriment of other users, and incentivizing better engineering choices, both of which are very relevant here. This is why every public API service (like Github) applies rate limits.
There is a limit to 12 simultaneous queries in flight for the Qserv-backed TAP service. You can see the current limit by going to data.lsst.cloud → your username → Settings → Quota. There is also an overall API rate limit for TAP (also listed on your quotas page), currently at 200 API requests per minute. However, this does not mean 200 TAP queries; for example depending on what client you are using and how you are using it, it may be making far more requests on your behalf. For example, according to our logs you were limited from the 200/minute limit when you were only doing 24 queries a minute, because likely your client (pyvo for example) was polling our service under the hood every second to check whether the query was complete.
We’re going to raise these limits somewhat during today’s Patch and we’ll do some more investigations on how we can raise them more agressively in the future. We will also look into contributing an improvement in pyvo’s behavior under these conditions.
To be clear, lsst.rsp is not part of the LSST Science Pipelines “stack”, it is specifically part of the capabilities of the Rubin Science Platform, so it works on IDACs that deploy the Rubin Science Platform but not NERSC. To access the API in a site agnostic manner, you can use a personal token to access our services. There is an example on how to do this with TOPCAT in our documentation, we will add a pyvo example since that seems it would be helpful.
Hi @frossie , thanks for the explanation! I understand that if the data is shared then there need to be some limitations to ensure everyone can make use of it. I think I was just surprised that running 10 jobs making TAP queries had me hitting the wall immediately. It’s great to hear the limits will be raised soon. Still, I see in the tutorials that TAP is the recommended way to access the catalogues. This seems to go against the LSST philosophy of putting the compute and the data together; if there is a web API with rate limits between the compute and the data.
I ultimately ended up converting all my TAP queries into butler queries. This feels less efficient since I am pulling whole columns from a tract in order to grab a single row for the object I’m interested in for each of my parallel jobs. But with the butler queries I am not running into any limits, I think since it is just directly accessing files on the cluster and so subject only to the restrictions on file access placed in the cluster. In the future I might be able to adjust my workflow to make individual large calls to the TAP service (the “better engineering choices” you mentioned). However, this turns the parallelization power of a cluster into a serial process, so isn’t conducive to truly scaling up. Also there are some queries that I’m not sure I could straightforwardly serialize (at least not in a way that would result in a more efficient query than the butler style of downloading whole columns).
As for the lsst stack, perhaps I don’t know exactly what the “stack” is. Still when I call import lsst on two devices and I have the same version of lsst on each device, I have an (I think natural) expectation that I will have the same modules in both.
I was able to get the site agnostic TAP online at NERSC with some great help from @heather999 pointing me to a comment from @mwv on the Rubin slack. Ultimately, it boiled down to this:
from pyvo.dal import TAPService
import requests
with open(token_file, "r") as f:
token_str = f.readline()
session = requests.Session()
session.headers["Authorization"] = f"Bearer {token_str}"
RSP_TAP_SERVICE = "https://data.lsst.cloud/api/tap"
service = TAPService(RSP_TAP_SERVICE, session=session)
assert service is not None
once I had the token from following the advice here: Creating user tokens — Rubin Science Platform
everything was identical, which was very nice. Though I guess pyvo is a bit aggressive with its API calls if it is hitting these rate limits. Perhaps an exponential back-off scheme could prevent it from overwhelming the service?
Anyway, at this point I think my question has been answered. @frossie you made it clear how TAP is expected to behave and why it does so. I moved to butler-only for this instance and in the future I can be mindful of the TAP behaviour to try and serialize my queries rather than relying on “dumb parallelization”. I’ll mark your comment as the solution to my question. The extra bit about the “stack” is perhaps just growing pains that will get ironed out over time.
I’m pretty sure you can retrieve subsets of the columns via the Butler, as well as you could with TAP. Might depend exactly on which tables you’re querying I suppose, but you should be able to apply the same kind of constraints you had with TAP.
Here is one of the CST tutorial notebooks with both TAP and Butler queries, including an example limiting the butler columns returned from the query.
Hi @ljones that is a fair point, there are some tables where one can restrict the returned data. But I think that is only for tables that are in the time domain (like the source table). I was searching in the object table, where the visit.timespan OVERLAPS :timespan constraint wouldn’t do anything. Even in that example from tutorial 201.2 they get all the sources from each visit that overlaps the requested timespan. And note that the position constraint visit_detector_region.region OVERLAPS POINT(:ra, :dec) gets the whole visit image if it intesects with the provided point. This is different from a cone search which will get sources (rather than full images) that are some radius from the search point.
If I want to reproduce a cone search, I need to first grab every object in every image that intersects with the cone center (butler), then manually do the cone search with some numpy boolean logic (or the like). At least that’s for the source table. For the object table, I would need to grab every tract that intersects with the point (the tract.regions are non-exclusive so it can be multiple), this gives me every object in all those tracts, then I can manually perform my cone search. If the cone is not very-very small, I may also need some extra logic to grab more tracts to make sure I cover the whole cone and not just the center point.
You might try opening a new question, with your specific thing you’re trying to do and ask for the best solution to address it with the Butler. I am not a butler expert, so just heard “retrieve subsets of columns” and thought it sounded like retrieving some columns from the butler table. It sounds like your question is actually different and more along the lines of - how to retrieve specific rows from the Object tables, then join to the Source table and then retrieve images? (I don’t know, so I’m going to see my unhelpful self out – but that’s why opening a new topic might be useful).
There is no global “lsst” to import. lsst is a namespace and not a single package that contains all the software we’ve written. Almost all of our software installs into the lsst hierarchy but what is available to you depends on what has been installed. lsst.rsp is completely distinct from lsst.daf.butler and having one installed in no way implies that the other is available. We deliberately share the top level namespace to avoid confusion over naming but there is no pip/conda install lsst which will guarantee that you get everything.
Hi @timj , thanks for the explanation, I understand more of how the lsst namespace works now. My comment was just that this is very far from my expectations when working with a python package. I have basically no understanding of how things work behind the scenes, but it would be really great if it was possible to pip/conda install lsst and get the same functionality everywhere. If this meant making rsp an independent package, I think that would be worth it to reduce confusion and follow the “principle of least astonishment”.
The Rubin Science Platform is distinct from the LSST Science Pipelines but some code shares the same namespace. If you are at NERSC and aren’t using the RSP what you have available is entirely down to NERSC. If you can use pip yourself then pip install --user lsst-rsp might work for you but we can’t make NERSC install everything. For DR1 we will have VO registry entries for the Rubin services so pyvo should be able to find the services automatically (you will still have to provide the API token).
As for the Butler question. Yes, wanting to know the details of one object with butler is laborious. The only way to do this efficiently is to try to work out multiple objects at a time and then do queries for multiple objects. If you can shard by tract you can still use butler to find multiple results from a single butler.get. How are you deciding which objects are of interest?
HI @timj I had not realized that the lsst-rsp package made sense to install anywhere other than on the RSP: https://pypi.org/project/lsst-rsp/
We could start including it at NERSC, I just want to make sure that generally makes sense. This would mean we could more naturally get_tap_service as folks do on the RSP? I guess I’m just worried about adding functions in lsst-rsp that won’t work when we’re at NERSC.
I think it’s only of interest for that specific API but I really don’t know if that API assumes that the RSP is available for it to work (ie whether it needs tokens to be in the environment in a specific place) or if it is simply hiding some endpoint information. Maybe @adam knows.