ToolsDB unstable following unplanned software upgrade
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	bd808
	Oct 24 2019, 6:17 PM

Description

A planned reboot related to T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) on 2019-10-24 unexpectedly resulted in the install of a different MariaDB package on the instance acting as the primary server for ToolsDB. This version of MariaDB appears to be highly unstable under our "normal" workloads.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T236420 ToolsDB unstable following unplanned software upgrade
Resolved	aborrero	T236384 Toolsdb: prevent unattended-upgrades from upgrading mariadb
Declined	None	T236399 Upgrade mariadb on toolsdb servers to 10.1.44
Resolved	bd808	T236423 Restore db write functionality for https://tools.wmflabs.org/deadlinks/api/

Event Timeline

bd808 created this task.Oct 24 2019, 6:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2019, 6:17 PM

bd808 triaged this task as High priority.Oct 24 2019, 6:17 PM

bd808 added a subtask: T236384: Toolsdb: prevent unattended-upgrades from upgrading mariadb.

In T236384#5603980, @Stashbot wrote:

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T18:10:56Z] <bd808> bd808 hacked public_html/api/index.php to stop all db writes. Trying to slow toolsdb high volume writes while we work on fixing a bad software update (see also T236384)

bd808 added a subtask: T236399: Upgrade mariadb on toolsdb servers to 10.1.44.Oct 24 2019, 6:22 PM

bd808 updated the task description. (Show Details)Oct 24 2019, 6:26 PM

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T18:34:51Z] <bstorm_> downgraded clouddb1001 to 10.1.39 T236420 T236384

Stashbot mentioned this in T236384: Toolsdb: prevent unattended-upgrades from upgrading mariadb.Oct 24 2019, 6:34 PM

At this point, we have some aggressive query killing going on as well. The downgrade might be a better bet.

Hardware seems clean.

It seems stable at this point with query killing off. The downgrade seems to be the ticket.

In T236420#5604802, @Bstorm wrote:

It seems stable at this point with query killing off. The downgrade seems to be the ticket.

Should we close T236399: Upgrade mariadb on toolsdb servers to 10.1.44 as WONTFIX and fix the pinning instead?

Perhaps we should at least drop it to low priority? We probably should use 1.42 after testing in prod.

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T08:13:02Z] <arturo> enable puppet in clouddb1001/clouddb1002 to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/546102 (T236384 , T236420)

It seems the secondary is using the wrong mariadb version:

aborrero@clouddb1002:~$ apt-cache policy wmf-mariadb101
wmf-mariadb101:
  Installed: 10.1.41-1
  Candidate: 10.1.41-1
  Version table:
 *** 10.1.41-1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

aborrero@clouddb1001:~$ apt-cache policy wmf-mariadb101
wmf-mariadb101:
  Installed: 10.1.39-1
  Candidate: 10.1.39-1
  Version table:
     10.1.41-1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
 *** 10.1.39-1 1002
        100 /var/lib/dpkg/status

I guess this is something we would like to fix. Not doing this myself until I get more people involved.

Interesting enough, grafana reports that clouddb1002 is running 10.1.38. But 10.1.41 is installed, as reported in the previous comment. No mention to 10.1.39 in clouddb1002.

I wonder if what happened here is we upgraded 2 versions without notice: 10.1.38 to 10.1.39 to 10.1.41 and after the reboot we moved directly from .38 to .41.

Yes, that "expected" as in different package and running versions work as wmf-mariadb101 doesn't stop the server while it is running, and it should work for the most part well, even if uninstalled. That it "works" doesn't mean that it is desirable.

I have reported the crashes at https://jira.mariadb.org/browse/MDEV-20323 . I have saved the syslog to clouddb1001:/home/jynus/syslog.1 in case we need it later for further debugging.

jcrespo added a project: Upstream.Oct 25 2019, 8:43 AM

Interestingly, we have labsdb1011 running on 10.1.41, but it doesn't have concurrent writes (it is read only, with only replication).

There seems to be multiple crashing bugs on 10.1 -> 10.3, and 10.1 -> 10.1 upgrades on JIRA, but I saw no bug fix on 10.1.42 changelog fitting our description.

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:37:10Z] <arturo> clouddb1002 downgrading wmf-mariadb101 from 10.1.41-1 to 10.1.39-1 (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:45:22Z] <arturo> icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:45:29Z] <arturo> icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:46:36Z] <arturo> (jynus) clouddb1002 mariadb (toolsdb secondary) being upgraded from 10.1.38 to 10.1.39 (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:49:26Z] <arturo> (jynus) clouddb1002 mariadb (toolsdb secondary) being upgraded from 10.1.38 to 10.1.39 is done !(T236384 , T236420)

Lowering priority. Things are looking "normal" following the downgrade back to a known stable MariaDB version.

bd808 closed subtask T236423: Restore db write functionality for https://tools.wmflabs.org/deadlinks/api/ as Resolved.Oct 28 2019, 4:10 PM

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Oct 31 2019, 1:09 AM

aborrero closed subtask T236384: Toolsdb: prevent unattended-upgrades from upgrading mariadb as Resolved.Nov 4 2019, 11:58 AM

jcrespo mentioned this in Unknown Object (Task).Nov 6 2019, 9:56 AM

There is a new 10.1.43 version which we believe might have fixed this issue. We'd like to know if it is possible to upgrade the tools host that crashed, to this new version?
This would also help us to evaluate if we should get our s1 (enwiki) master upgraded to 10.1.43 (as we are planning to fail it over next thursday)

Ladsgroup subscribed.Mar 2 2022, 5:14 AM

Closing this years-old ticket. The current plan is to upgrade to MariaDB 10.4.

	F30879585: image.png
	Oct 25 2019, 8:25 AM

ToolsDB unstable following unplanned software upgradeClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

ToolsDB unstable following unplanned software upgrade
Closed, ResolvedPublic
Actions

Related Objects
Search...