A planned reboot related to T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) on 2019-10-24 unexpectedly resulted in the install of a different MariaDB package on the instance acting as the primary server for ToolsDB. This version of MariaDB appears to be highly unstable under our "normal" workloads.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T236420 ToolsDB unstable following unplanned software upgrade | |||
Resolved | aborrero | T236384 Toolsdb: prevent unattended-upgrades from upgrading mariadb | |||
Declined | None | T236399 Upgrade mariadb on toolsdb servers to 10.1.44 | |||
Resolved | bd808 | T236423 Restore db write functionality for https://tools.wmflabs.org/deadlinks/api/ |
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2019-10-24T18:34:51Z] <bstorm_> downgraded clouddb1001 to 10.1.39 T236420 T236384
At this point, we have some aggressive query killing going on as well. The downgrade might be a better bet.
It seems stable at this point with query killing off. The downgrade seems to be the ticket.
Should we close T236399: Upgrade mariadb on toolsdb servers to 10.1.44 as WONTFIX and fix the pinning instead?
Perhaps we should at least drop it to low priority? We probably should use 1.42 after testing in prod.
Mentioned in SAL (#wikimedia-cloud) [2019-10-25T08:13:02Z] <arturo> enable puppet in clouddb1001/clouddb1002 to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/546102 (T236384 , T236420)
It seems the secondary is using the wrong mariadb version:
aborrero@clouddb1002:~$ apt-cache policy wmf-mariadb101 wmf-mariadb101: Installed: 10.1.41-1 Candidate: 10.1.41-1 Version table: *** 10.1.41-1 1001 1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages 100 /var/lib/dpkg/status
vs
aborrero@clouddb1001:~$ apt-cache policy wmf-mariadb101 wmf-mariadb101: Installed: 10.1.39-1 Candidate: 10.1.39-1 Version table: 10.1.41-1 1001 1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages *** 10.1.39-1 1002 100 /var/lib/dpkg/status
I guess this is something we would like to fix. Not doing this myself until I get more people involved.
Interesting enough, grafana reports that clouddb1002 is running 10.1.38. But 10.1.41 is installed, as reported in the previous comment. No mention to 10.1.39 in clouddb1002.
I wonder if what happened here is we upgraded 2 versions without notice: 10.1.38 to 10.1.39 to 10.1.41 and after the reboot we moved directly from .38 to .41.
Yes, that "expected" as in different package and running versions work as wmf-mariadb101 doesn't stop the server while it is running, and it should work for the most part well, even if uninstalled. That it "works" doesn't mean that it is desirable.
I have reported the crashes at https://jira.mariadb.org/browse/MDEV-20323 . I have saved the syslog to clouddb1001:/home/jynus/syslog.1 in case we need it later for further debugging.
Interestingly, we have labsdb1011 running on 10.1.41, but it doesn't have concurrent writes (it is read only, with only replication).
There seems to be multiple crashing bugs on 10.1 -> 10.3, and 10.1 -> 10.1 upgrades on JIRA, but I saw no bug fix on 10.1.42 changelog fitting our description.
Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:37:10Z] <arturo> clouddb1002 downgrading wmf-mariadb101 from 10.1.41-1 to 10.1.39-1 (T236384 , T236420)
Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:45:22Z] <arturo> icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)
Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:45:29Z] <arturo> icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)
Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:46:36Z] <arturo> (jynus) clouddb1002 mariadb (toolsdb secondary) being upgraded from 10.1.38 to 10.1.39 (T236384 , T236420)
Mentioned in SAL (#wikimedia-cloud) [2019-10-25T10:49:26Z] <arturo> (jynus) clouddb1002 mariadb (toolsdb secondary) being upgraded from 10.1.38 to 10.1.39 is done !(T236384 , T236420)
Lowering priority. Things are looking "normal" following the downgrade back to a known stable MariaDB version.
There is a new 10.1.43 version which we believe might have fixed this issue. We'd like to know if it is possible to upgrade the tools host that crashed, to this new version?
This would also help us to evaluate if we should get our s1 (enwiki) master upgraded to 10.1.43 (as we are planning to fail it over next thursday)