Cluster setup for Rubin Science Pipeline

MarkusRabus · February 15, 2024, 6:29pm

Hello,
I am looking to set up a (slurm) cluster consisting of 8-10 nodes to run the Rubin Science Pipeline on DECam images. I would greatly appreciate your advice on the best way to configure the cluster. For instance, which server platform is preferred: Intel or AMD? Can I run the pipeline without problems using 6-8GB of RAM per core? Does it make sense to have an Infiniband intracluster network? Which Linux server distributions do you recommend (Rocky vs. Alma, or even a different one)? Do you have any additional recommendations? Thank you in advance for your help.

price · February 16, 2024, 3:04pm

I’ll let others weigh in on the particular hardware, but I will strongly recommend that you use a cluster filesystem rather than regular NFS.

frossie · February 16, 2024, 6:07pm

You can see the data of our own sizing model for systems including pipeline processing here: https://dmtn-135.lsst.io

I believe there have been discussions recently about running DECAM data in situ through our own clusters at USDF - just making sure you are aware of them and not duplicating effort. Generally co-locating processing is much more efficient in human effort, if it can be arranged.

timj · February 16, 2024, 6:15pm

4GB/core is what we are targeting.

We are using Alma linux but it doesn’t really matter. Intel vs AMD shouldn’t matter and we are also starting to investigate ARM Linux.

Our BPS batch processing software does not talk to SLURM directly but can use SLURM via Parsl or HTCondor layers.

ktl · February 16, 2024, 6:26pm

But note that certain steps in the Data Release Production will require significantly more memory per core. (We’re trying to minimize the number of such steps, but they will always exist.)

ktl · February 16, 2024, 6:30pm

In other hardware configuration, our processing is typically “high throughput” with independent jobs and so a very-high-performance network between worker nodes is not usually necessary. High aggregate bandwidth to storage can be desirable, however. That storage can be a shared filesystem (as Paul says, preferably one optimized for scalability) or an object store with an S3 or WebDAV interface.