Page MenuHomePhabricator

PAWS kills active users servers that are not connected to a user session
Open, MediumPublicFeature

Description

RileyBot had Terminal 1 and Terminal 2 actively running one task each at maxlag, at which rate the bot would complete the tasks in about 9 hours (19,000 edits)

RyanBot had Terminal 1 and Terminal 2 actively running one task each at maxlag, at which rate the bot would complete the tasks in about three weeks (145,000 edits)

Throughout yesterday and today, both users have had their servers shut off several times stopping their running tasks. This usually happens after the task has been running for a few hours.

There would be no period of inactivity except for the scripts sleeping between edits due to maxlag.

Event Timeline

Chicocvenancio renamed this task from Actively running servers shut down unexpectedly to PAWS kills active users servers that are not connected to a user session.Mar 2 2018, 12:36 AM
Chicocvenancio triaged this task as Medium priority.
Chicocvenancio moved this task from Backlog to MVP (Most Valuable PAWS) on the PAWS board.
Chicocvenancio subscribed.

This is a know behaviour/bug in jupyterhub. Looking at one of the issues debating this, it seems the activity tracking backbone has been built, it might not involve a great amount of work to develop a script that uses that to define activity in a different (better) way than how cull_idlle.py does it at the moment.

For reference:
PAWS uses a culler that will kill user servers that are not connected to a browser for more than one hour.

With the new culling behavior in upcoming 0.9 it will be possible to configure culling user servers that are disconnected from a network perspective (current behavior) or from a "server busy" perspective. Feedback on possible sane values welcome.

Chicocvenancio changed the subtype of this task from "Task" to "Feature Request".Mar 2 2019, 2:00 PM
Chicocvenancio edited projects, added PAWS; removed PAWS (JupyterHub 0.9).

When can we use it?

With the new culling behavior in upcoming 0.9 it will be possible to configure culling user servers that are disconnected from a network perspective (current behavior) or from a "server busy" perspective. Feedback on possible sane values welcome.

Toolforge is an alternative for long tasks.

I would like to give this a bump. If all of the rationales provided in https://www.mediawiki.org/wiki/PAWS#Why%3F are to believed, then simply telling users to use Toolforge for any task that might need to run for more than an hour isn't the solution. I have made hundreds of thousands of edits via PAWS, using PWB, and it would be an incredible improvement to experience, as a user, to not have to worry about keeping a browser session connected.

@Dominicbm I agree this is a great improvement for PAWS. Unfortunately I do not have the bandwidth to develop this for the foreseeable future and no other volunteers have stepped up.

SandraF_WMF subscribed.

Putting this under the attention of technical scoping by WMSE, as this is a typical bug fix to a crucial tool, which would have large impact when fixed.

@SandraF_WMF I would love to help onboard more contributors to PAWS and help with technical scoping of this contribution, sorry it took me a while to notice the comment.

Would it be feasible to add some kind of 'callback/unattended' functionality for PAWS sessions, where you know a task using it will take a long time?

The thought was that you get a 'notification' when it's finished.

For various reasons, I would suggest that an unattended mode is something that you have to ask for specifically as a user right, like AWB access or bot flag.

The timeout is now 24 hours. This will be an experiment and we might need to reduce it, but the intention is to support more long running tasks in PAWS.

The timeout is now 24 hours. This will be an experiment and we might need to reduce it, but the intention is to support more long running tasks in PAWS.

Is this verified to still be working, or was the limit changed back? I have seen sessions killed 2 or 3 times recently when disconnected for only a few hours.

This was removed...However it was removed in October, not August. I've put in a PR to return the timeout to 24 hours. Though I wasn't able to recreate the results in minikube when I set a short timeout, job seemed to be running after the timeout and the ten minute cull loop.
https://github.com/toolforge/paws/pull/108

This was removed...However it was removed in October, not August. I've put in a PR to return the timeout to 24 hours. Though I wasn't able to recreate the results in minikube when I set a short timeout, job seemed to be running after the timeout and the ten minute cull loop.
https://github.com/toolforge/paws/pull/108

Based on this comment, I am confused if the timeout was ever increased? I recently experienced a session getting killed after only a few hours, but I don't have exact timesteamps.

Based on this comment, I am confused if the timeout was ever increased? I recently experienced a session getting killed after only a few hours, but I don't have exact timesteamps.

I have never actually identified on what the loop was. As I mentioned, I couldn't get the cull timeout working in the minikube env with a short timeout. So I'm not convinced that it was doing anything.

Regardless I suspect what you're seeing is that there have been a number of cluster upgrades over the last few days. Which involves shutting down the servers on the old cluster. In which situation had you restarted your server was actually running on a new k8s cluster.

T331056 and T328842 were the most recent ones (Having run a few hours ago, and yesterday) There have been 5 total k8s clusters since the beginning of the year. The next one will probably be a jump in k8s from v1.22 to 1.26 (7?) hopefully in a few months.