A planned reboot related to T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) on 2019-10-24 unexpectedly resulted in the install of a different MariaDB package on the instance acting as the primary server for ToolsDB. This version of MariaDB appears to be highly unstable under our "normal" workloads.
|Open||None||T236420 ToolsDB unstable following unplanned software upgrade|
|Resolved||aborrero||T236384 Toolsdb: prevent unattended-upgrades from upgrading mariadb|
|Open||None||T236399 Upgrade mariadb on toolsdb servers to 10.1.44|
|Resolved||bd808||T236423 Restore db write functionality for https://tools.wmflabs.org/deadlinks/api/|
Mentioned in SAL (#wikimedia-cloud) [2019-10-25T08:13:02Z] <arturo> enable puppet in clouddb1001/clouddb1002 to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/546102 (T236384 , T236420)
It seems the secondary is using the wrong mariadb version:
aborrero@clouddb1002:~$ apt-cache policy wmf-mariadb101 wmf-mariadb101: Installed: 10.1.41-1 Candidate: 10.1.41-1 Version table: *** 10.1.41-1 1001 1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages 100 /var/lib/dpkg/status
aborrero@clouddb1001:~$ apt-cache policy wmf-mariadb101 wmf-mariadb101: Installed: 10.1.39-1 Candidate: 10.1.39-1 Version table: 10.1.41-1 1001 1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages *** 10.1.39-1 1002 100 /var/lib/dpkg/status
I guess this is something we would like to fix. Not doing this myself until I get more people involved.
Interesting enough, grafana reports that clouddb1002 is running 10.1.38. But 10.1.41 is installed, as reported in the previous comment. No mention to 10.1.39 in clouddb1002.
I wonder if what happened here is we upgraded 2 versions without notice: 10.1.38 to 10.1.39 to 10.1.41 and after the reboot we moved directly from .38 to .41.
Yes, that "expected" as in different package and running versions work as wmf-mariadb101 doesn't stop the server while it is running, and it should work for the most part well, even if uninstalled. That it "works" doesn't mean that it is desirable.
I have reported the crashes at https://jira.mariadb.org/browse/MDEV-20323 . I have saved the syslog to clouddb1001:/home/jynus/syslog.1 in case we need it later for further debugging.
Interestingly, we have labsdb1011 running on 10.1.41, but it doesn't have concurrent writes (it is read only, with only replication).
There seems to be multiple crashing bugs on 10.1 -> 10.3, and 10.1 -> 10.1 upgrades on JIRA, but I saw no bug fix on 10.1.42 changelog fitting our description.
There is a new 10.1.43 version which we believe might have fixed this issue. We'd like to know if it is possible to upgrade the tools host that crashed, to this new version?
This would also help us to evaluate if we should get our s1 (enwiki) master upgraded to 10.1.43 (as we are planning to fail it over next thursday)