Page MenuHomePhabricator

Refill tool stuck "waiting for an available worker"
Closed, ResolvedPublic

Description

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-03-24T20:19:08Z] <bstorm> running kubectl delete pods refill-api-6c78d8cdd-d6kfq T278211

Any update on this issue ?? 3 days not working.

Sometimes does not get to the "Pending
Waiting for an available worker." message but gives a "Backend error" message.

The only thing I can do with this tool is restart the refill-api pods generally. I already did that. Normally, it recovers.
If it hasn't that's concerning as it might need real maintenance, and the tool is unmaintained to my knowledge.

It looks like it has been working and it is choking on some bad URLs:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/data/project/refill-api/refill/backend/refill/tasks.py", line 34, in fixWikipage
    ctx.applyTransforms(wikicode)
  File "/data/project/refill-api/refill/backend/refill/models/context.py", line 69, in applyTransforms
    transform.apply(self.wikicode)
  File "/data/project/refill-api/refill/backend/refill/transforms/fillexternal.py", line 34, in apply
    future.result()
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 395, in result
    return self.__get_result()
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 354, in __get_result
    raise self._exception
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/data/project/refill-api/refill/backend/refill/transforms/fillexternal.py", line 53, in _fulfill
    p.apply(citation)
  File "/data/project/refill-api/refill/backend/refill/dataparsers/download.py", line 21, in apply
    response = session.get(citation.url)
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='collaborativeprojectsarchive.wikispaces.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9782b518>: Failed to establish a new connection: [Errno -2] Name or service not known',))
[2021-03-24 20:38:35,843: WARNING/ForkPoolWorker-55] took 961.3701734542847
[2021-03-24 20:38:35,904: VERBOSE/ForkPoolWorker-55] Found 1 wikipedia:en processes running, including this one.

That traceback is very common for this tool, and not much of a sign of anything.
It also died at one point:

[2021-03-24 20:27:22,939: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
[2021-03-24 20:27:23,724: ERROR/MainProcess] Process 'ForkPoolWorker-34' pid:533 exited with 'signal 9 (SIGKILL)'

That happened several times since I restarted it, and I'm not sure it has any way to recover with the software it uses. So I'll restart it again.

Mentioned in SAL (#wikimedia-cloud) [2021-03-25T15:16:25Z] <bstorm> deleting pods to restart service T278211

Tried again and still getting the "Pending
Waiting for an available worker." message.

I don't have much ability to troubleshoot why it isn't recovering. Nobody has changed it because it is unmaintained. If someone would like to take over the tool, that would be great. As is, it is stuck on python 3.4, which is dead software anyway.

It is logging completions of tasks, so it appears to be working some of the time at least, but I don't have any real insight into why it is throwing errors or even if those task completions are legitimate in the logs.

I deleted the webservice pod as well (refill tool), but I doubt that will have much impact. I can find nothing helpful that I can do without some pretty solid knowledge of the tool and a lot of time to rewrite parts of it. I hope this isn't the death of it, because I know people use it.

I think it is significant that the tool appears to work in languages other than en. To demonstrate this

  1. go to https://refill.toolforge.org/
  2. enter Microsoft as the name of the article. I pick Microsoft as it is an article that exists in most if not all language versions.
  3. With the language dropdown set to 'en', click the 'Fix page' button and observe that the tool gets stuck at 'waiting for an available worker'.
  4. Hit the back button and repeat the test but change the language to something else, such as 'de', 'sv' or 'simple'. Observe that the tool works as expected.

So I think @Bstorm this explains why you are seeing completions in the logs.

I think the next step should be to examine the logs to see if it throws any light on why the tasks relating to 'en' are failing.

The logs only show that it has caching issues at times (redis? not sure if that's what that message relates to in the source code). Also a lot of this:

[2021-03-25 15:20:01,459: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
[2021-03-25 15:20:01,462: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/data/project/refill-api/www/python/venv/lib/python3.4/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

Since I genuinely am not committing to being a proper maintain for this tool (volunteers strongly encouraged!!), I did what I could with that information, which is mostly to investigate if it is using too much memory and getting OOM'd or something. There are OOM issues on the worker node that this is on, so what I've done is set the QoS to Guaranteed for the memory-hogging part of this so it schedules on a quiet node, and then I also bumped the stats on the uwsgi container in case that was the issue. The pod itself is fine, which makes this hard to troubleshoot. Hopefully that helps something. The logs are not terribly useful. Upgrading its libraries and python to 3.7 might help a lot (jessie containers will not be supported forever).

If it is somehow consuming more resources, I cannot keep bumping it up, though. If it now leaks resources, that needs development.

I've compared the HTTP requests generated when setting the language to en vs de and there's nothing unusual.

I've also tried running

curl "https://refill.toolforge.org/result.php" -H "authority: refill.toolforge.org" --data-raw "page=Microsoft&wiki=de&method-wiki=Fix+page&noaccessdate=on"

and

curl "https://refill.toolforge.org/result.php" -H "authority: refill.toolforge.org" --data-raw "page=Microsoft&wiki=en&method-wiki=Fix+page&noaccessdate=on"

The former does what it should, whereas the latter returns nothing

Any update on this? The backlog is increasing tremendously

Not really. The tool has no maintainer that I am aware of. I just reboot it sometimes (and am not a maintainer--don't even have commit rights on the repo). If that's not helping, I've done some quick best-effort checks and nothing is jumping out. I'll happily help someone adopt the tool if it meets the criteria here https://wikitech.wikimedia.org/wiki/Help:Toolforge/Abandoned_tool_policy

This tool meets that standard from what I can tell.

For what its worth, I've spent the whole day setting up my own dev environment to be able to debug the code. Not quite there yet. Hopefully I can identify the root cause, and come up with a fix. We would probably need to go down the abandoned tool route to get it deployed into production, though.

For what its worth, I've spent the whole day setting up my own dev environment to be able to debug the code. Not quite there yet. Hopefully I can identify the root cause, and come up with a fix. We would probably need to go down the abandoned tool route to get it deployed into production, though.

I think the repo suggests it requires Python 3.6 IIRC. The actual tool runs on Python 3.4, which makes me wonder if the code that is up there is the same as the code on the tool repo. I'll check that for you.

There are 3 commits that are not automated from translatewiki that are not present.
This is the last commit that is actually on the refill-api tool. I presume the other commits are for the frontend refill tool, so the original maintainer never deployed them? https://github.com/zhaofengli/refill/commit/115a5c42e31b65f841f30e16ffce5c81eee59036

@Bstorm yes, the repo does say there is a dependency on Python 3.6, which was the case with the version of Celery it was built with, but version 5.0.2 of Celery supports Python 3.9. If there are no breaking changes we can upgrade it at the same time as a whole load of deprecated modules.

Hi all, somehow I missed the various notifications. I'm not active on Wikipedia anymore and my knowledge of the tool has faded over the years. From the first glance at the logs posted here I'm not able to see what the root cause is, and I'm logging into Toollabs today to see what's going on.

Not really. The tool has no maintainer that I am aware of. I just reboot it sometimes (and am not a maintainer--don't even have commit rights on the repo). If that's not helping, I've done some quick best-effort checks and nothing is jumping out. I'll happily help someone adopt the tool if it meets the criteria here https://wikitech.wikimedia.org/wiki/Help:Toolforge/Abandoned_tool_policy

This tool meets that standard from what I can tell.

I do not have time or motivation to maintain the tool, and I'll be glad if someone can take over. I can transfer the tool as well as the GitHub repository to a volunteer.

@Bstorm yes, the repo does say there is a dependency on Python 3.6, which was the case with the version of Celery it was built with, but version 5.0.2 of Celery supports Python 3.9. If there are no breaking changes we can upgrade it at the same time as a whole load of deprecated modules.

It's been years since the dependencies were last updated, and it might help in this case. Let me know if you want access to the tool on Toolforge.

@zhaofengli it is great to hear from you. It would be very helpful to have a controlled handover. So yes please on access. I have created a separate Github account for the purpose: https://github.com/CurbSafeCharmer

It seems to be that this problem is only affecting Refill when the language is set to en. All other languages appear to work fine. So I believe there's been a breaking change on en.wiki. The priority is to prove that this is the case, identify what the change was, and either address the issue at the Wikipedia end or amend the Refill code to deal with whatever has changed on en.wiki.

@zhaofengli it is great to hear from you. It would be very helpful to have a controlled handover. So yes please on access. I have created a separate Github account for the purpose: https://github.com/CurbSafeCharmer

It seems to be that this problem is only affecting Refill when the language is set to en. All other languages appear to work fine. So I believe there's been a breaking change on en.wiki. The priority is to prove that this is the case, identify what the change was, and either address the issue at the Wikipedia end or amend the Refill code to deal with whatever has changed on en.wiki.

I can't seem to find your username on Toolforge. Do you have an LDAP account on Wikitech?

@zhaofengli just this moment created a Curb Safe Charmer account on wikitech.wikimedia.org.

@Curb_Safe_Charmer if you haven't previously, you'll also want to follow the various steps here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart to be able to access the tool setup in Toolforge--in case you haven't found that already.

@zhaofengli just this moment created a Curb Safe Charmer account on wikitech.wikimedia.org.

Ok, access granted. I just updated all dependencies and deployed the new version of the backend, and we'll see if the failure will occur again. I have also updated the tool to use a new queue prefix so it can start fresh without the stuck tasks.

Good work guys.... just did a run a minute ago, looks good!