conf* hosts ran out of disk space due to log spam
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Nov 3 2022, 6:16 PM

Description

several conf* host run out of disk space due to too many actions taken from snapshot1013. This is believed to be caused by a bug on the mediawiki dumping scripts querying and reloading the state of the database configuration too often (once per row?): https://gerrit.wikimedia.org/r/c/mediawiki/core/+/798678/13/includes/export/WikiExporter.php

This caused lvs hosts to complain about not being able to contact etcd:

[18:43] <icinga-wm> PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 2 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal
[18:43] <icinga-wm> PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 37 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal

Pending things:

Patch logic so etcd config reloads do not happen so aggressively
Restart dump process
Fix pending dbctl commits
Restart db maintenance
Something else?

Details

Subject	Repo	Branch	Lines +/-
WikiExporter: Avoid calling reload in processing every row	mediawiki/core	REL1_39	+0 -1
WikiExporter: Avoid calling reload in processing every row	mediawiki/core	master	+0 -1
WikiExporter: Avoid calling reload in processing every row	mediawiki/core	wmf/1.40.0-wmf.8	+0 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	jijiki	T322360 conf* hosts ran out of disk space due to log spam
Resolved	Joe	T322400 Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often
Resolved	JFG	T322363 Data dumps for November aborted

Event Timeline

jcrespo created this task.Nov 3 2022, 6:16 PM

jcrespo added a project: Wikimedia-Incident.

jcrespo renamed this task from conf* host ran out of disk space due to log spam to conf* hosts ran out of disk space due to log spam.Nov 3 2022, 6:20 PM

jcrespo updated the task description. (Show Details)

Change 852990 had a related patch set uploaded (by BBlack; author: Amir Sarabadani):

[mediawiki/core@master] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852990

gerritbot added a project: Patch-For-Review.Nov 3 2022, 6:30 PM

jcrespo updated the task description. (Show Details)Nov 3 2022, 6:30 PM

Change 852883 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.40.0-wmf.8] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852883

Change 852990 merged by jenkins-bot:

[mediawiki/core@master] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852990

Change 852884 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_39] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852884

Change 852883 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.8] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852883

Mentioned in SAL (#wikimedia-operations) [2022-11-03T18:59:29Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]]

Mentioned in SAL (#wikimedia-operations) [2022-11-03T18:59:48Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.10; 2022-11-14).Nov 3 2022, 7:00 PM

Mentioned in SAL (#wikimedia-operations) [2022-11-03T19:03:54Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] (duration: 04m 24s)

Not sure if in scope with the task, but we should add monitoring to a metric like https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&from=now-7d&to=now&viewPanel=4 to catch issues like these as early as possible.

In T322360#8368215, @elukey wrote:

Not sure if in scope with the task, but we should add monitoring to a metric like https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&from=now-7d&to=now&viewPanel=4 to catch issues like these as early as possible.

+1, although I would add it on another task to avoid forgetting it, more in the "ways to prevent something similar to happen again/improve monitoring", keeping this just for the immediate actionables.

jcrespo updated the task description. (Show Details)Nov 3 2022, 8:57 PM

Adding @daniel as I believe this was the problematic patch, 5b0b54599bfd, but I am not 100% sure, because I would guess it would have failed at the beginning of October, last month dumps? Or was it feature-flag-disabled?

Also in case a more fine-grained fix would be needed (e.g. to reload config in other locations of the exporting process).

There was a config setting that turned it on for November. See https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/848201

Thank you, Ariel!

ArielGlenn mentioned this in T322363: Data dumps for November aborted.Nov 4 2022, 6:01 AM

jcrespo updated the task description. (Show Details)Nov 4 2022, 10:38 AM

After the patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852990 was backported and deployed, dumps resumed running about an hour later (there is a script that checks to see if they are running and if not, restarts them twice a day during the run period). As of this morning the stubs jobs seem to be completing normally. etcd looks good on the graphs too, as Amir noted, see below:

Στιγμιότυπο από 2022-11-04 12-41-54.png (788×1 px, 60 KB)

jcrespo updated the task description. (Show Details)Nov 4 2022, 10:54 AM

DB maintenance is back to normal/no longer affected, as far as I understood from @Marostegui and @Ladsgroup.

With this, further tuning of config reload for dumps -if needed- should probably happen on T298485 or followup. CC @daniel
Followup caused by dump disruption -if needed- should happen on T322363 CC @ArielGlenn
I've created a new ticket to followup on the monitoring gap at: T322400 CC serviceops / observability

A small incident report summary should happen soon at: https://wikitech.wikimedia.org/wiki/Incident_status CC @andrea.denisse

jcrespo triaged this task as High priority.Nov 4 2022, 11:12 AM

jcrespo added a subtask: T322400: Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often.

jcrespo added a subtask: T322363: Data dumps for November aborted.

Change 852884 merged by jenkins-bot:

[mediawiki/core@REL1_39] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852884

Maintenance_bot removed a project: Patch-For-Review.Nov 4 2022, 6:31 PM

ReleaseTaggerBot added a project: MW-1.39-notes.Nov 4 2022, 7:00 PM

ArielGlenn closed subtask T322363: Data dumps for November aborted as Resolved.Nov 8 2022, 9:33 AM

ArielGlenn moved this task from Backlog to Other teams on the Dumps-Generation board.Nov 8 2022, 9:36 AM

This task itself looks like it is done, please reopen if you disagreen or if I am missing something :)

Joe closed subtask T322400: Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often as Resolved.Mar 24 2023, 9:40 AM

ArielGlenn moved this task from Other teams to Done on the Dumps-Generation board.Jun 22 2023, 4:57 AM

	F35705781: Στιγμιότυπο από 2022-11-04 12-41-54.png
	Nov 4 2022, 10:43 AM

conf* hosts ran out of disk space due to log spamClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

conf* hosts ran out of disk space due to log spam
Closed, ResolvedPublic
Actions

Related Objects
Search...