Page MenuHomePhabricator

db1097 (m1 master) crashed due to memory issues.
Closed, ResolvedPublic

Description

db1097 crashed due to memory errors and rebooted itself:

	properties
		CreationTimestamp = 20200630051510.000000-300
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
		RecordFormat = string Description
		RecordID = 15
		CreationTimestamp = 20200630051510.000000-300
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
		RecordFormat = string Description
		RecordID = 13
		CreationTimestamp = 20200630051510.000000-300
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
		RecordFormat = string Description
		RecordID = 12

Times in UTC

[06:16:29]  <+icinga-wm>	PROBLEM - Host db1097 is DOWN: PING CRITICAL - Packet loss = 100%
[06:23:53]  <+icinga-wm>	RECOVERY - Host db1097 is UP: PING WARNING - Packet loss = 50%, RTA = 0.25 ms

Multiple errors on its memory. This host will be replaced next FY, so maybe not worth buying anything for it. We can just replace it with db1080.
This required etherpad reload.

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.
Marostegui updated the task description. (Show Details)

I was in process of moving db1080 to m2, but I will move it to m1 instead so we can replace and decommission this host.

Change 608543 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] site.pp: Move db1080 to m1 instead of m2

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608543

Change 608543 merged by Marostegui:
[operations/puppet@production] site.pp: Move db1080 to m1 instead of m2

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608543

Mentioned in SAL (#wikimedia-operations) [2020-06-30T08:05:52Z] <marostegui> Stop MySQL on db1117:3322 to clone db1080 (this will trigger haproxy alerts) - T256717

Change 608580 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1080: Enable notifications

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608580

db1080 is ready. Now we just need to schedule another m1 failover to promote db1080 to master.

Change 608580 merged by Marostegui:
[operations/puppet@production] db1080: Enable notifications

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608580

Actually I just realised that this host won't be replaced next FY, as we are replacing up to db1095.

@akosiaris @jcrespo let's replace this master on Wednesday at 08:00 AM UTC?

Change 610010 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1080 to m1 master

https://gerrit.wikimedia.org/r/610010

Failover procedure:

OLD MASTER: db1097

NEW MASTER: db1080

  • Check configuration differences between new and old master

$ pt-config-diff h=db1097.eqiad.wmnet,F=/root/.my.cnf h=db1080.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1080

switchover.py --timeout=1 --only-slave-move db1097.eqiad.wmnet db1080.eqiad.wmnet

  • Disable puppet @db1097, puppet @db1080 puppet agent --disable "switchover to db1080"
  • Merge gerrit: https://gerrit.wikimedia.org/r/610010
  • Run puppet on dbproxy1012 (active) and dbproxy1014 and check the config

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1097 to db1080 - T256717
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# ./switchover.py --skip-slave-move db1097 db1080
Reload haproxies

dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1097)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1097 and db1080 puppet agent --enable && puppet agent -tv
  • Check services affected: etherpad, bacula, librenms, rt
  • Update/resolve phabricator ticket about failover
  • Create decommissioning ticket for db1097 T257406

Mentioned in SAL (#wikimedia-operations) [2020-07-08T06:47:42Z] <marostegui> start topology changes on m1 T256717

Change 610010 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1080 to m1 master

https://gerrit.wikimedia.org/r/610010

Mentioned in SAL (#wikimedia-operations) [2020-07-08T07:48:54Z] <jynus> stop bacula-director on backup1001 in preparation for m1 switchover T256717

Mentioned in SAL (#wikimedia-operations) [2020-07-08T08:00:53Z] <marostegui> Failover m1 from db1097 to db1080 - T256717

Change 610236 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1097: Disable notifications

https://gerrit.wikimedia.org/r/610236

Change 610236 merged by Marostegui:
[operations/puppet@production] db1097: Disable notifications

https://gerrit.wikimedia.org/r/610236

Marostegui claimed this task.

All done - the decommissioning on db1097 will be tracked at T257406
Thanks Jaime and Alex for supporting this maintenance!