Can't run notebooks on the RSP for extended periods of time

,

Hello everyone, sorry if this has been posted before but I can’t find anything on this. I have a notebook that needs to process a lot of data, this process should take several days, however after a couple of hours the RSP sort of softly shuts down the notebook and it becomes unresponsive or just stops processing things completely. This leads to me needing to restart the kernel and refresh the page to get things running again. I’ve attached a screenshot of what it looks like when it is frozen up, those simple imports just won’t finish running. Additionally there is this file save error that comes up after about 30mins of running the notebook but everything seems to keep saving just fine up to a point so I’m less concerned about that, but should I be? Is there anything I can/should do on my end to work around all of this? I’d like to just let things run without needing to monitor the tab. Thanks!

Hi

So the RSP notebook service is (explicitly) not a bulk processing (aka) batch processing system and in fact there are limits to prevent it being used as such. Services for running longer computational jobs will eventually be available, but they are not part of the Data Preview 0 services. However you would have to run for several days before you run into those limits so I doubt this is the case here.

If your kernel is dying after “a couple of hours” the most likely explanation is that you are running out of memory at which point your container is automatically killed by the system [specifically Kubernetes]. Only 12GB of RAM are available to any one user and if you’re not careful you can exceed those.

If you wish to do a prolonged investigation, I suggest you structure it in a way that chunks the data in small batches that captures intermediate results and output to avoid running into limits and to be able to resume if you do.

If you believe this is not the case (for example if you find your session is terminated after a long but low resource operation such as sleep(36000) or similar please file a support ticket at Issues · rubin-dp0/Support · GitHub

I see, I will try monitor this. Thank you for your response, this is very insightful.