Page MenuHomePhabricator

conf* hosts ran out of disk space due to log spam
Closed, ResolvedPublic

Description

several conf* host run out of disk space due to too many actions taken from snapshot1013. This is believed to be caused by a bug on the mediawiki dumping scripts querying and reloading the state of the database configuration too often (once per row?): https://gerrit.wikimedia.org/r/c/mediawiki/core/+/798678/13/includes/export/WikiExporter.php

This caused lvs hosts to complain about not being able to contact etcd:

[18:43] <icinga-wm> PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 2 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal
[18:43] <icinga-wm> PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 37 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal

Pending things:

  • Patch logic so etcd config reloads do not happen so aggressively
  • Restart dump process
  • Fix pending dbctl commits
  • Restart db maintenance
  • Something else?

Event Timeline

jcrespo renamed this task from conf* host ran out of disk space due to log spam to conf* hosts ran out of disk space due to log spam.Nov 3 2022, 6:20 PM
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)

Change 852990 had a related patch set uploaded (by BBlack; author: Amir Sarabadani):

[mediawiki/core@master] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852990

Change 852883 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.40.0-wmf.8] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852883

Change 852990 merged by jenkins-bot:

[mediawiki/core@master] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852990

Change 852884 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_39] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852884

Change 852883 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.8] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852883

Mentioned in SAL (#wikimedia-operations) [2022-11-03T18:59:29Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]]

Mentioned in SAL (#wikimedia-operations) [2022-11-03T18:59:48Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-11-03T19:03:54Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] (duration: 04m 24s)

Not sure if in scope with the task, but we should add monitoring to a metric like https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&from=now-7d&to=now&viewPanel=4 to catch issues like these as early as possible.

Not sure if in scope with the task, but we should add monitoring to a metric like https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&from=now-7d&to=now&viewPanel=4 to catch issues like these as early as possible.

+1, although I would add it on another task to avoid forgetting it, more in the "ways to prevent something similar to happen again/improve monitoring", keeping this just for the immediate actionables.

Adding @daniel as I believe this was the problematic patch, 5b0b54599bfd, but I am not 100% sure, because I would guess it would have failed at the beginning of October, last month dumps? Or was it feature-flag-disabled?

Also in case a more fine-grained fix would be needed (e.g. to reload config in other locations of the exporting process).

After the patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852990 was backported and deployed, dumps resumed running about an hour later (there is a script that checks to see if they are running and if not, restarts them twice a day during the run period). As of this morning the stubs jobs seem to be completing normally. etcd looks good on the graphs too, as Amir noted, see below:

Στιγμιότυπο από 2022-11-04 12-41-54.png (788×1 px, 60 KB)

jcrespo added subscribers: Marostegui, Ladsgroup.

DB maintenance is back to normal/no longer affected, as far as I understood from @Marostegui and @Ladsgroup.

With this, further tuning of config reload for dumps -if needed- should probably happen on T298485 or followup. CC @daniel
Followup caused by dump disruption -if needed- should happen on T322363 CC @ArielGlenn
I've created a new ticket to followup on the monitoring gap at: T322400 CC serviceops / observability

Change 852884 merged by jenkins-bot:

[mediawiki/core@REL1_39] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852884

jijiki claimed this task.
jijiki subscribed.

This task itself looks like it is done, please reopen if you disagreen or if I am missing something :)