Page MenuHomePhabricator

MediaWiki periodic job failures due to timeouts
Open, In Progress, HighPublic

Assigned To
None
Authored By
jijiki
Apr 7 2026, 12:28 PM
Referenced Files
F75319581: image.png
Apr 8 2026, 11:43 AM
F75318592: image.png
Apr 8 2026, 11:32 AM
F75259445: image.png
Apr 7 2026, 7:12 PM
F75237448: image.png
Apr 7 2026, 12:51 PM
F75237382: image.png
Apr 7 2026, 12:51 PM
F75236659: image.png
Apr 7 2026, 12:38 PM
F75236641: image.png
Apr 7 2026, 12:38 PM
F75236627: image.png
Apr 7 2026, 12:38 PM

Description

Lately we have been having quite a few cronjob failures, with @phaultfinder opening various tasks.

On March 25th @ 15:00 we switched all jobs to eqiad, as part of T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad)
On April 2nd @ 13:17 we pooled codfw back for reads

I updated the mw-cron (MediaWiki Periodic Jobs on k8s) dashboard, in a effort to extract more information. Due to multiline logs, the MediaWiki Maintenance Jobs - k8s search returns a lot of results, making filtering a bit difficult.

Culprit #1
Related to T422455: Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th.

{
  "query": {
    "regexp": {
      "log.keyword": ".*curl error: 28.*Timeout was reached.*"
    }
  }
}

Culprit #2
Additionally, there are some jobs which seem to fail due to inability to talk to the DB

image.png (212×2 px, 76 KB)

{
  "query": {
    "regexp": {
      "log.keyword": ".*Error.+2006.+MySQL server has gone away.*"
    }
  }
}

Related Objects

Event Timeline

I filtered timeouts from the mediamoderation-hourlyscan job in an attempt to establish if we are seeing those timouts more after switching to eqiad.

Please note that the graphs below do not yield necessarily failed jobs, though it is worth connecting those dots as well.

  • March 8th - March 15th

image.png (942×2 px, 187 KB)

  • March 15th - March 22nd

image.png (900×2 px, 204 KB)

  • March 23rd - March 29th

image.png (1×2 px, 236 KB)

  • March 29th - April 4th

image.png (946×3 px, 238 KB)

  • April 4th - April 7th (3 days)

image.png (958×2 px, 213 KB)

jijiki changed the task status from Open to In Progress.Apr 7 2026, 12:45 PM
jijiki triaged this task as High priority.
jijiki updated the task description. (Show Details)
jijiki renamed this task from MediaWiki periodic job failures to MediaWiki periodic job failures due to timeouts.Apr 7 2026, 1:09 PM

Things looks quite well so far mw-cron (MediaWiki Periodic Jobs on k8s) after merging 1268569. More details in T422455#11795500

image.png (754×1 px, 104 KB)

I can't help noticing that MediaWiki periodic job update-special-pages-s5 failed failed twice for the same reason, which is either a very unfortunate coincidence related to T422489: rdbms errors in eqiad, or something worth investigating.

image.png (328×2 px, 93 KB)