PAWS kills active users servers that are not connected to a user session
Closed, ResolvedPublicFeature
Actions

Assigned To

None

Authored By

	Riley_Huntley
	Mar 2 2018, 12:14 AM

Description

RileyBot had Terminal 1 and Terminal 2 actively running one task each at maxlag, at which rate the bot would complete the tasks in about 9 hours (19,000 edits)

RyanBot had Terminal 1 and Terminal 2 actively running one task each at maxlag, at which rate the bot would complete the tasks in about three weeks (145,000 edits)

Throughout yesterday and today, both users have had their servers shut off several times stopping their running tasks. This usually happens after the task has been running for a few hours.

There would be no period of inactivity except for the scripts sleeping between edits due to maxlag.

Related Objects

Mentioned In: rPAWS3e5726489d37: Merge pull request #108 from toolforge/T188684
rPAWS10dfc2b62eff: increasing timeout on cull script
T243459: Add Cron Job Functionality to PAWS (Outreachy internship)
Mentioned Here: T328842: Restructure paws away from special networking
T331056: PAWS cluster for nfs cutover

Event Timeline

Riley_Huntley created this task.Mar 2 2018, 12:14 AM

Chicocvenancio renamed this task from Actively running servers shut down unexpectedly to PAWS kills active users servers that are not connected to a user session.Mar 2 2018, 12:36 AM

Chicocvenancio triaged this task as Medium priority.

This is a know behaviour/bug in jupyterhub. Looking at one of the issues debating this, it seems the activity tracking backbone has been built, it might not involve a great amount of work to develop a script that uses that to define activity in a different (better) way than how cull_idlle.py does it at the moment.

For reference:
PAWS uses a culler that will kill user servers that are not connected to a browser for more than one hour.

Ivanhercaz subscribed.May 26 2018, 6:28 PM

Chicocvenancio edited projects, added PAWS (JupyterHub 0.9); removed PAWS.Jun 8 2018, 4:47 PM

Chicocvenancio moved this task from Backlog to Easy tasks on the PAWS (JupyterHub 0.9) board.Jun 8 2018, 4:50 PM

With the new culling behavior in upcoming 0.9 it will be possible to configure culling user servers that are disconnected from a network perspective (current behavior) or from a "server busy" perspective. Feedback on possible sane values welcome.

Chicocvenancio reopened this task as Open.Jun 9 2018, 10:31 AM

Chicocvenancio closed this task as a duplicate of T196808: PAWS terminal automatically shuts down through my account.

Chicocvenancio merged a task: T196808: PAWS terminal automatically shuts down through my account.

Chicocvenancio added subscribers: Info-farmer, jayantanth, Bodhisattwa and 2 others.

Chicocvenancio moved this task from Easy tasks to MVP (Most Valuable PAWS) on the PAWS (JupyterHub 0.9) board.Jun 21 2018, 9:22 PM

Chicocvenancio changed the subtype of this task from "Task" to "Feature Request".Mar 2 2019, 2:00 PM

Chicocvenancio edited projects, added PAWS; removed PAWS (JupyterHub 0.9).

Mahir256 subscribed.Jul 4 2019, 5:46 AM

Bugreporter merged a task: T229193: Keep the server running.Jul 29 2019, 8:02 AM

Bugreporter added subscribers: Wmr-bot, Bugreporter, wmr.

When can we use it?

In T188684#4267566, @Chicocvenancio wrote:

With the new culling behavior in upcoming 0.9 it will be possible to configure culling user servers that are disconnected from a network perspective (current behavior) or from a "server busy" perspective. Feedback on possible sane values welcome.

Toolforge is an alternative for long tasks.

I would like to give this a bump. If all of the rationales provided in https://www.mediawiki.org/wiki/PAWS#Why%3F are to believed, then simply telling users to use Toolforge for any task that might need to run for more than an hour isn't the solution. I have made hundreds of thousands of edits via PAWS, using PWB, and it would be an incredible improvement to experience, as a user, to not have to worry about keeping a browser session connected.

@Dominicbm I agree this is a great improvement for PAWS. Unfortunately I do not have the bandwidth to develop this for the foreseeable future and no other volunteers have stepped up.

Chicocvenancio mentioned this in T243459: Add Cron Job Functionality to PAWS (Outreachy internship).Jan 23 2020, 9:44 PM

Putting this under the attention of technical scoping by WMSE, as this is a typical bug fix to a crucial tool, which would have large impact when fixed.

@SandraF_WMF I would love to help onboard more contributors to PAWS and help with technical scoping of this contribution, sorry it took me a while to notice the comment.

Lokal_Profil edited projects, added WMSE-Tools-for-Partnerships-2020; removed WMSE-Tools-for-Partnerships-2019-Blueprinting.Feb 16 2021, 9:35 AM

Lokal_Profil moved this task from Backlog to To consider in technical scoping on the WMSE-Tools-for-Partnerships-2020 board.

Bugreporter merged a task: T272465: PAWS terminal (or Pywikibot) cannot seemingly run for an extended period in unattended mode.Feb 18 2021, 7:46 PM

Bugreporter added subscribers: ShakespeareFan00, JJMC89, QEDK and 2 others.

Would it be feasible to add some kind of 'callback/unattended' functionality for PAWS sessions, where you know a task using it will take a long time?

The thought was that you get a 'notification' when it's finished.

For various reasons, I would suggest that an unattended mode is something that you have to ask for specifically as a user right, like AWB access or bot flag.

The timeout is now 24 hours. This will be an experiment and we might need to reduce it, but the intention is to support more long running tasks in PAWS.

Jopparn edited projects, added WMSE-Content-partnerships-support-2021-Software-development; removed WMSE-Tools-for-Partnerships-2020.Jul 24 2021, 12:26 AM

• nskaggs moved this task from MVP (Most Valuable PAWS) to Planning on the PAWS board.Aug 3 2021, 2:25 PM

In T188684#7106038, @Chicocvenancio wrote:

The timeout is now 24 hours. This will be an experiment and we might need to reduce it, but the intention is to support more long running tasks in PAWS.

Is this verified to still be working, or was the limit changed back? I have seen sessions killed 2 or 3 times recently when disconnected for only a few hours.

rook mentioned this in rPAWS10dfc2b62eff: increasing timeout on cull script.Nov 22 2021, 9:12 PM

This was removed...However it was removed in October, not August. I've put in a PR to return the timeout to 24 hours. Though I wasn't able to recreate the results in minikube when I set a short timeout, job seemed to be running after the timeout and the ten minute cull loop.
https://github.com/toolforge/paws/pull/108

Restricted Repository Identity mentioned this in rPAWS3e5726489d37: Merge pull request #108 from toolforge/T188684.Nov 23 2021, 2:16 PM

In T188684#7521655, @rook wrote:

This was removed...However it was removed in October, not August. I've put in a PR to return the timeout to 24 hours. Though I wasn't able to recreate the results in minikube when I set a short timeout, job seemed to be running after the timeout and the ten minute cull loop.
https://github.com/toolforge/paws/pull/108

Based on this comment, I am confused if the timeout was ever increased? I recently experienced a session getting killed after only a few hours, but I don't have exact timesteamps.

In T188684#8693913, @Dominicbm wrote:

Based on this comment, I am confused if the timeout was ever increased? I recently experienced a session getting killed after only a few hours, but I don't have exact timesteamps.

I have never actually identified on what the loop was. As I mentioned, I couldn't get the cull timeout working in the minikube env with a short timeout. So I'm not convinced that it was doing anything.

Regardless I suspect what you're seeing is that there have been a number of cluster upgrades over the last few days. Which involves shutting down the servers on the old cluster. In which situation had you restarted your server was actually running on a new k8s cluster.

T331056 and T328842 were the most recent ones (Having run a few hours ago, and yesterday) There have been 5 total k8s clusters since the beginning of the year. The next one will probably be a jump in k8s from v1.22 to 1.26 (7?) hopefully in a few months.

Lokal_Profil removed a project: WMSE-Content-partnerships-support-2021-Software-development.Jun 3 2024, 6:16 AM

I've tried to spread out cluster rebuilds some since my last comment. Haven't heard similar issues since then so that may well have been the issue. Please reopen if seen again.

rook closed this task as Resolved.Fri, Nov 8, 6:07 PM

PAWS kills active users servers that are not connected to a user sessionClosed, ResolvedPublicFeatureActions

Description

Related Objects

Event Timeline

PAWS kills active users servers that are not connected to a user session
Closed, ResolvedPublicFeature
Actions