Hello,
I am looking to set up a (slurm) cluster consisting of 8-10 nodes to run the Rubin Science Pipeline on DECam images. I would greatly appreciate your advice on the best way to configure the cluster. For instance, which server platform is preferred: Intel or AMD? Can I run the pipeline without problems using 6-8GB of RAM per core? Does it make sense to have an Infiniband intracluster network? Which Linux server distributions do you recommend (Rocky vs. Alma, or even a different one)? Do you have any additional recommendations? Thank you in advance for your help.
I’ll let others weigh in on the particular hardware, but I will strongly recommend that you use a cluster filesystem rather than regular NFS.
You can see the data of our own sizing model for systems including pipeline processing here: https://dmtn-135.lsst.io
I believe there have been discussions recently about running DECAM data in situ through our own clusters at USDF - just making sure you are aware of them and not duplicating effort. Generally co-locating processing is much more efficient in human effort, if it can be arranged.
4GB/core is what we are targeting.
We are using Alma linux but it doesn’t really matter. Intel vs AMD shouldn’t matter and we are also starting to investigate ARM Linux.
Our BPS batch processing software does not talk to SLURM directly but can use SLURM via Parsl or HTCondor layers.
But note that certain steps in the Data Release Production will require significantly more memory per core. (We’re trying to minimize the number of such steps, but they will always exist.)
In other hardware configuration, our processing is typically “high throughput” with independent jobs and so a very-high-performance network between worker nodes is not usually necessary. High aggregate bandwidth to storage can be desirable, however. That storage can be a shared filesystem (as Paul says, preferably one optimized for scalability) or an object store with an S3 or WebDAV interface.