Page MenuHomePhabricator

ToolsDB unstable following unplanned software upgrade
Closed, ResolvedPublic

Description

A planned reboot related to T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) on 2019-10-24 unexpectedly resulted in the install of a different MariaDB package on the instance acting as the primary server for ToolsDB. This version of MariaDB appears to be highly unstable under our "normal" workloads.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T18:10:56Z] <bd808> bd808 hacked public_html/api/index.php to stop all db writes. Trying to slow toolsdb high volume writes while we work on fixing a bad software update (see also T236384)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T18:34:51Z] <bstorm_> downgraded clouddb1001 to 10.1.39 T236420 T236384

At this point, we have some aggressive query killing going on as well. The downgrade might be a better bet.

It seems stable at this point with query killing off. The downgrade seems to be the ticket.

It seems stable at this point with query killing off. The downgrade seems to be the ticket.

Should we close T236399: Upgrade mariadb on toolsdb servers to 10.1.44 as WONTFIX and fix the pinning instead?

Perhaps we should at least drop it to low priority? We probably should use 1.42 after testing in prod.

It seems the secondary is using the wrong mariadb version:

aborrero@clouddb1002:~$ apt-cache policy wmf-mariadb101
wmf-mariadb101:
  Installed: 10.1.41-1
  Candidate: 10.1.41-1
  Version table:
 *** 10.1.41-1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

vs

aborrero@clouddb1001:~$ apt-cache policy wmf-mariadb101
wmf-mariadb101:
  Installed: 10.1.39-1
  Candidate: 10.1.39-1
  Version table:
     10.1.41-1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
 *** 10.1.39-1 1002
        100 /var/lib/dpkg/status

I guess this is something we would like to fix. Not doing this myself until I get more people involved.

Interesting enough, grafana reports that clouddb1002 is running 10.1.38. But 10.1.41 is installed, as reported in the previous comment. No mention to 10.1.39 in clouddb1002.

image.png (591×2 px, 92 KB)

I wonder if what happened here is we upgraded 2 versions without notice: 10.1.38 to 10.1.39 to 10.1.41 and after the reboot we moved directly from .38 to .41.

Yes, that "expected" as in different package and running versions work as wmf-mariadb101 doesn't stop the server while it is running, and it should work for the most part well, even if uninstalled. That it "works" doesn't mean that it is desirable.

I have reported the crashes at https://jira.mariadb.org/browse/MDEV-20323 . I have saved the syslog to clouddb1001:/home/jynus/syslog.1 in case we need it later for further debugging.

Interestingly, we have labsdb1011 running on 10.1.41, but it doesn't have concurrent writes (it is read only, with only replication).

There seems to be multiple crashing bugs on 10.1 -> 10.3, and 10.1 -> 10.1 upgrades on JIRA, but I saw no bug fix on 10.1.42 changelog fitting our description.

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:37:10Z] <arturo> clouddb1002 downgrading wmf-mariadb101 from 10.1.41-1 to 10.1.39-1 (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:45:22Z] <arturo> icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:45:29Z] <arturo> icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:46:36Z] <arturo> (jynus) clouddb1002 mariadb (toolsdb secondary) being upgraded from 10.1.38 to 10.1.39 (T236384 , T236420)

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:49:26Z] <arturo> (jynus) clouddb1002 mariadb (toolsdb secondary) being upgraded from 10.1.38 to 10.1.39 is done !(T236384 , T236420)

bd808 lowered the priority of this task from High to Medium.Oct 25 2019, 5:00 PM

Lowering priority. Things are looking "normal" following the downgrade back to a known stable MariaDB version.

jcrespo mentioned this in Unknown Object (Task).Nov 6 2019, 9:56 AM

There is a new 10.1.43 version which we believe might have fixed this issue. We'd like to know if it is possible to upgrade the tools host that crashed, to this new version?
This would also help us to evaluate if we should get our s1 (enwiki) master upgraded to 10.1.43 (as we are planning to fail it over next thursday)

taavi subscribed.

Closing this years-old ticket. The current plan is to upgrade to MariaDB 10.4.