Notes from NCSA/AP discussion at UW Oct. 18 and 19

Attendees: @ctslater, @mtpatter, @KSK, Felipe M., @jalt, Rahul Biswas (partial), @connolly (partial)

Introduction to thoughts on L1 prompt processing system (Felipe):
James Parsons is doing the work to support the prompt processing system

Production prompt processing will require the following steps:

  1. Launch jobs via condor
  2. Utilities developed for supporting production processing will figure out what the pipeline needs (KSK: I don’t think there is a concrete plan for how this is done) and caches it to local disk
  3. Spawn processing job on the local resources running on the locally cached data using a butler pointing to the local repository
  4. Collect data products (KSK: I’m not clear if this is the job of the prompt pipelines or is an afterburner)

KSK: How do we do this in development? I.e. I don’t want to have to do a pre-caching step if I’m just testing things out, or running on a single node.
Felipe: We will absolutely attempt to make the development environment as close to the production environment as possible. This will mean needing a butler that knows how to get data from the data backbone (KSK: we went back and forth on the butler reading data from the data backbone, but I think this is where we ended up. Felipe, correct me if I’m wrong).

Introduction to the L1 Prompt pipelines (Simon):
One of the aspects that came out of the interface discussions is that there are three distinct kinds of information that flow out of the pipelines

  1. First class data products — I.e. defined in the DPDD
  2. Internal data products — Useful internally to science pipelines: e.g. backgrounds, sources. These are typically considered ephemeral.
  3. Data that needs to be broadcast outside science pipelines — logs, QA/QC information, possibly information fed back to OCS, etc.

A concrete example of the third type of data is performance statistics: e.g. timing, memory, network, I/O, and CPU usage. We have been handling this with decorators on methods, but if we need to continue to do that, we’ll need a) more decorators and b) to come up with a standard for how pipeline developers should be decorating their code. The other option is to have external instrumentation to measure performance.
** Can we measure performance at the level we want from outside the tasks or do they need to be instrumented from within? (Felipe)

The third type of data is not completely enumerated and has no process for being enumerated. It may not be possible to enumerate these. This suggests that we need two things (or possibly just the one):

  1. An ability to be flexible about what quantities are persisted by the prompt pipelines: e.g. metadata to logs, intermediate data products that could be used in further analysis
  2. An analysis pipeline that would operate in parallel to produce other outputs useful outside science pipelines. Code to be written by the AP team. These would be values too costly to compute in the prompt pipelines (the pipelines could miss deadlines).
    ** Figure out how we plan this part (@gpdf?; @mjuric?)

Similarly, there may be some information we need from the OCS.

  1. information required by processing. This information is attached to an exposure. It seems like the preferred path for moving that information into the processing is to attach them to the exposure metadata.
  2. Information not necessarily associated to a visit: i.e. slew, dome open, dome closed, what are the visits scheduled for the next 2 hrs. This needs to be made available, but I don’t know that there is a spec for this.
    ** Write tickets to request OCS information (@KSK; @connolly? ).

Finally, we note that the solar system ephemeris calculations needed for source association can be done in parallel and may benefit from doing so (i.e. we may have lots of predictions to make in a little time).

Data I/O abstraction:
As alluded to earlier, there was a lot of going back and forth about how data will be gotten off the data backbone.
From the AP team there was a very strong assumption that we would be able to use the butler to get/put things to and from the data backbone, but Felipe noted the idea in production was to use other tools to grab necessary data and pre-cache it to the local POSIX storage and hand the pipeline tasks a butler instantiated on that local repository.

Felipe noted that the need for separate tools for moving data to/from the data backbone was for robustness/scalability/reliability issues. Simon and Colin pointed out that the Butler is an API meant as an abstraction to any underlying storage technology, and that any robust tools could be plugged into the Butler as a backend. There was some pushback to this notion since the tool developers would have to implement the Butler parts as well any time the underlying technology changed.

The AP team thought it would be very useful to have the butler be able to talk to the data backbone for development. Felipe said that would be a worthy goal.

Some of the confusion seems to be related to a conflation of the Butler as an API and the backends. Part of this is related to the fact that the only implemented backend currently is a POSIX filesystem. It was unclear whether that was because we only ever expect to have a POSIX backend.
** Need to follow up with @ktl involving having a project policy that there will be a I/O abstraction API that has access to the “data backbone” (@KSK )

Alert distribution (with Jason Alt):
Jason: What is the interface between Prompt Processing and Alert Distribution?
Simon: I’m hoping this turns out to be a “butler.put” that talks directly to the Kafka server.
Felipe: That may be a problem to talk from the worker nodes to an external service.

Maria would like to have medium/large flavors with large attached ephemeral storage (1TB) available in Nebula.
**Jason Alt will ping @daues to see if we can get flavors of instances with attached ephemeral storage. (@jalt )

Maria’s estimate for the Kafka cluster is that we need something like 3x8 core nodes with 1TB fast attached storage.
Jason: Don’t skimp on hardware since it is less expensive to provide more compute than the time and effort it takes to try to fit things in less space.

Prototype workflow(Felipe):
Currently coded up in python calling all the command line tasks in python in sequence: ingest and processCcd.py. Currently using DECam data. No attempt to be scientifically valid at the moment, so some blocks are mocked. Just want an end-to-end pipeline to figure out where pieces are missing.

Prototype datasets
Felipe is currently using DECam.
Simon: We should try to use a dataset where we can get calibration products since that is some of the most complicated I/O we do in the prompt pipelines.
Colin: I think we can get calibs for the HiTS data.
** Get HiTS calibration products (@ctslater?)
** Make sure obs_decam is up to the task of using the calibration products we can get. (@mrawls? )

To follow up on this, we can not get 1TB ephemeral storage on Nebula. I don’t think that is what you want anyway because you would want it to be persistent to some extent. Your real request is fast disk and on Nebula, the best you’ll do to meet persistent plus fast is cinder block storage (iscsi I believe). We can however get the core/memory flavors if they do not currently exist.

Start in nebula w/ cinder and let’s see where the bottlenecks are. I get the feeling that we are on a collision course with the container cluster management stuff we are looking into now. You aren’t the only service that has fast storage requirements (I’m looking at you Qserv).

@jalt thanks for looking, but we do not want it to be persistent at all (@mtpatter can correct me if I’m wrong). It only needs to live as long as the instance it is attached to.

It’s true that we would like it to be local because it’s fast and will not take up bandwidth on the cluster, but we do not need a combination of fast and persistent.

We’ll certainly start with nebula, but we’ll certainly want to try out other things on a relatively short timescale.

Right. I know what I really want, and my real request is ephemeral storage. I do not need persistence beyond the life of a vm and prefer to be able to easily blow everything away mid-experiment. We can try with block storage for now, but as far as I am aware I cannot finely control where blocks are physically located on Nebula relative to the compute instances, which would make benchmarking results really variable. How much ephemeral storage can we get? Even a smaller amount would be useful just to compare.

1 Like

I’m thinking of the scenario in which the machine your VM is on crashes and you have not yet saved the data off into the data backbone. Re instantiating that instance will not make the data available. The scenario in which you shutdown the instance intentionally is the easiest one.

You really can not control where blocks are allocated with ephemeral storage either. You will have contention for resources in both scenarios. Variable performance is common on all shared resources, cloud or not.

We have ~136TiB of block storage. And we can add more.

I know I cannot control where an instance with ephemeral storage is allocated, wherever it is does not matter. I just want the storage to be within the instance. Can we have Nebula flavors configured with some non-negligible amount of ephemeral storage if not 1 TiB?

It’s not that you can not control where the instance with ephemeral storage is allocated; you can not control the allocation of the ephemeral storage. I see no changes coming for ephemeral storage sizes.

Yes, understood, but that is fine. We will be running an application replicated across multiple VMs so it’s fine if the data goes away.

O.K. so there will be no flavors with any ephemeral storage at all.

Just to facilitate conversation in the future, I think the term ephemeral storage may be causing some confusion.

From the OpenStack web page:

Ephemeral storage is allocated for an instance and is deleted when the instance is deleted. The Compute service manages ephemeral storage. By default, Compute stores ephemeral drives as files on local disks on the Compute node but Ceph RBD can instead be used as the storage back end for ephemeral storage.

I think @mtpatter is assuming the default configuration for OpenStack where the ephemeral storage is always on local disk. My impression is that the Nebula machines have effectively zero local disk, so even if there was ephemeral storage it would not be local. I hope I’m getting that right.

All flavors have ephemeral disk. The quote from the OpenStack web page is accurate (of course) for default installations. In order to support live migrations, Nebula layers Gluster on top of compute storage. I think it is likely that the storage is local but can not guarantee it.

Ephemeral disk is also shared disk, local or not. It doesn’t isolate you from noisy neighbors.

That is also very true. Thanks for the clarification.

Here’s some follow up from the actions outlined in the initial post:

I did follow up with K-T and he confirmed that there will be a backend to the butler that will be able to access the “data backbone”. I did not get a firm timeline for that development, but he indicated that it was a well known need and that it would definitely happen.

I talked with Colin and Meredith. At least some of the HiTS data is now public and is available through the NOAO Archive. We did some poking around and found that the calibrations are available as well. We did not go as far as to put together a coherent dataset, but Meredith said she would do that.

@felipe @mgelman2 Do you have notes from the meeting that we had last week? If so, could you post them here?