Page MenuHomePhabricator

Refill tool stuck "waiting for an available worker" due to database lag
Closed, ResolvedPublic

Description

Usual problem on English Wikipedia.


[2022-06-30 12:55:04,994: INFO/ForkPoolWorker-3] Sleeping for 120.0 seconds, 2022-06-30 12:55:04
[2022-06-30 12:55:09,922: VERBOSE/ForkPoolWorker-6] Pausing due to database lag: Waiting for 10.64.0.98: 4165.835196 seconds lagged.

confirmed at https://en.wikipedia.org/w/api.php?action=query&titles=MediaWiki&format=json&maxlag=-1

{
  "error": {
    "code": "maxlag",
    "info": "Waiting for 10.64.0.98: 3467.984717 seconds lagged.",
    "host": "10.64.0.98",
    "lag": 3467.984717,
    "type": "db",
    "*": "See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
  },
  "servedby": "mw1316"
}

Event Timeline

TheresNoTime changed the task status from Open to In Progress.Jun 30 2022, 12:24 PM

Complaining of database lag?

tools.refill-api@tools-sgebastion-10:~$ kubectl logs refill-api-worker-569f9f5698-rgwfj --follow

[2022-06-30 12:33:07,474: VERBOSE/ForkPoolWorker-6] Found 9 wikipedia:en processes running, including this one.
[2022-06-30 12:33:07,600: VERBOSE/ForkPoolWorker-6] Pausing due to database lag: Waiting for 10.64.0.98: 14823.524278 seconds lagged.
Sleeping for 120.0 seconds, 2022-06-30 12:33:07
[2022-06-30 12:33:07,600: INFO/ForkPoolWorker-6] Sleeping for 120.0 seconds, 2022-06-30 12:33:07
[2022-06-30 12:33:30,988: VERBOSE/ForkPoolWorker-4] Pausing due to database lag: Waiting for 10.64.0.98: 14682.919216 seconds lagged.
Sleeping for 120.0 seconds, 2022-06-30 12:33:30
[2022-06-30 12:33:30,989: INFO/ForkPoolWorker-4] Sleeping for 120.0 seconds, 2022-06-30 12:33:30
[2022-06-30 12:33:39,767: VERBOSE/ForkPoolWorker-5] Pausing due to database lag: Waiting for 10.64.0.98: 14631.618491 seconds lagged.
Sleeping for 120.0 seconds, 2022-06-30 12:33:39
[2022-06-30 12:33:39,768: INFO/ForkPoolWorker-5] Sleeping for 120.0 seconds, 2022-06-30 12:33:39
[2022-06-30 12:33:44,126: VERBOSE/ForkPoolWorker-1] Pausing due to database lag: Waiting for 10.64.0.98: 14607.055482 seconds lagged.
Sleeping for 120.0 seconds, 2022-06-30 12:33:44
[2022-06-30 12:33:44,126: INFO/ForkPoolWorker-1] Sleeping for 120.0 seconds, 2022-06-30 12:33:44

yet https://replag.toolforge.org doesn't show any... hmm..

Ah I'm looking in the wrong place — https://en.wikipedia.org/w/api.php?action=query&titles=MediaWiki&format=json&maxlag=-1 shows a lag:

{
  "error": {
    "code": "maxlag",
    "info": "Waiting for 10.64.0.98: 5922.806938 seconds lagged.",
    "host": "10.64.0.98",
    "lag": 5922.806938,
    "type": "db",
    "*": "See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
  },
  "servedby": "mw1341"
}

[2022-06-30 12:49:09,306: VERBOSE/ForkPoolWorker-6] Pausing due to database lag: Waiting for 10.64.0.98: 7519.173907 seconds lagged.
Sleeping for 120.0 seconds, 2022-06-30 12:49:09
[2022-06-30 12:49:09,308: INFO/ForkPoolWorker-6] Sleeping for 120.0 seconds, 2022-06-30 12:49:09
[2022-06-30 12:49:32,520: VERBOSE/ForkPoolWorker-4] Found 8 wikipedia:en processes running, including this one.
[2022-06-30 12:49:32,629: VERBOSE/ForkPoolWorker-4] Pausing due to database lag: Waiting for 10.64.0.98: 7332.522778 seconds lagged.
Sleeping for 120.0 seconds, 2022-06-30 12:49:32
[2022-06-30 12:49:32,629: INFO/ForkPoolWorker-4] Sleeping for 120.0 seconds, 2022-06-30 12:49:32
[2022-06-30 12:49:41,404: VERBOSE/ForkPoolWorker-5] Found 8 wikipedia:en processes running, including this one.
[2022-06-30 12:49:41,507: VERBOSE/ForkPoolWorker-5] Pausing due to database lag: Waiting for 10.64.0.98: 7262.435271 seconds lagged.
Sleeping for 120.0 seconds, 2022-06-30 12:49:41
[2022-06-30 12:49:41,508: INFO/ForkPoolWorker-5] Sleeping for 120.0 seconds, 2022-06-30 12:49:41

lag seems to be decreasing, so will monitor

TheresNoTime lowered the priority of this task from High to Medium.Jun 30 2022, 12:54 PM

Dropping priority — doesn't initially appear to be an error with refill, though will continue to monitor 🙂

TheresNoTime renamed this task from Refill tool stuck "waiting for an available worker" to Refill tool stuck "waiting for an available worker" due to database lag.Jun 30 2022, 12:56 PM
TheresNoTime updated the task description. (Show Details)

Database lag recovered (0.145272 seconds lagged), reFill now appears to be working as expected (test edit) and a follow-up task to T311713: Show a warning to the user if refill is throttling due to database lag has been created — marking resolved.

@Keith_D thank you again for the report, could you confirm everything is working for you?

Hello, thanks for looking at this, all appears to be working again.