Page MenuHomePhabricator

Decommission db1052
Closed, ResolvedPublic

Description

db1052 was the old s1 master which has been failed over to db1067 (T197069)
Let's wait a few days before sending it to the DCOps for total decommissioning

  • Once the network maintenance is done remove db1089 as candidate master for s1 (T197069#4418823)
  • Take a snapshot of /srv/sqldata and place it somewhere (dbstore1001 is a good candidate) (T199861#4468052)

Decommission Checklist

START NON-INTERRUPPTABLE STEPS - please assign to @RobH for the non-interrupt steps

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Marostegui triaged this task as Normal priority.Jul 18 2018, 6:36 AM
Marostegui created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2018, 6:36 AM

I thought a bit about how to go over this, and given the importance and history of this host, this would be one proposal, see what you think about it:

  • Wait 1 week to make sure we are not going to fail back immediately
  • Archive and compress a tarball on the database hosts just in case for e.g. 3 months

I thought a bit about how to go over this, and given the importance and history of this host, this would be one proposal, see what you think about it:

  • Wait 1 week to make sure we are not going to fail back immediately

Yeah, my idea was to even wait till 31st July - after the network maintenance.

  • Archive and compress a tarball on the database hosts just in case for e.g. 3 months

Agreed!

Change 446533 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1052: Disable notifications, upgrade socket

https://gerrit.wikimedia.org/r/446533

Marostegui moved this task from Triage to Next on the DBA board.

Change 446533 merged by Marostegui:
[operations/puppet@production] db1052: Disable notifications, upgrade socket

https://gerrit.wikimedia.org/r/446533

Change 449652 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1089: Change binlog to ROW

https://gerrit.wikimedia.org/r/449652

Change 449652 merged by Marostegui:
[operations/puppet@production] db1089: Change binlog to ROW

https://gerrit.wikimedia.org/r/449652

Mentioned in SAL (#wikimedia-operations) [2018-08-01T04:47:51Z] <marostegui> Stop MySQL on db1052 to copy its content to dbstore1001 - https://phabricator.wikimedia.org/T199861

Change 449653 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/449653

Change 449653 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/449653

db1052's content has been copied to dbstore1001:/srv/backups/tmp/db1052
For the record, these are the coordinates after the stop:

root@PRODUCTION s1 master[(none)]> show slave status\G show master status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: db1067.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db1067-bin.001325
          Read_Master_Log_Pos: 712219114
               Relay_Log_File: db1052-relay-bin.000158
                Relay_Log_Pos: 712219402
        Relay_Master_Log_File: db1067-bin.001325
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 712219114
              Relay_Log_Space: 712219744
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171974720
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171970637-5484646134,171974720-171974720-88503795,171970637-171970637-2116621969,171978774-171978774-5,180359172-180359172-49702203
1 row in set (0.00 sec)

*************************** 1. row ***************************
            File: db1052-bin.005999
        Position: 323268603
    Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)

Change 449665 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db1052

https://gerrit.wikimedia.org/r/449665

Mentioned in SAL (#wikimedia-operations) [2018-08-01T07:09:48Z] <marostegui> Remove db1052 from tendril - T199861

Change 449665 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db1052

https://gerrit.wikimedia.org/r/449665

Mentioned in SAL (#wikimedia-operations) [2018-08-01T07:11:32Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove db1052 from config as it will be decommissioned - T199861 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2018-08-01T07:12:35Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Remove db1052 from config as it will be decommissioned - T199861 (duration: 00m 55s)

Change 449666 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Set db1052 to spare

https://gerrit.wikimedia.org/r/449666

Change 449666 merged by Marostegui:
[operations/puppet@production] mariadb: Set db1052 to spare

https://gerrit.wikimedia.org/r/449666

Marostegui moved this task from Next to Done on the DBA board.

db1052 is now ready for DCOps to finish its decommissioning - assigning it to @RobH
db1052 was a great s1 master but now it needs some rest!! :-)

Restricted Application added a project: Operations. · View Herald TranscriptAug 1 2018, 7:23 AM
Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Aug 1 2018, 2:32 PM

Change 452385 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Removing db1052 from site.pp final decommission

https://gerrit.wikimedia.org/r/452385

Change 452385 merged by Cmjohnson:
[operations/puppet@production] Removing db1052 from site.pp final decommission

https://gerrit.wikimedia.org/r/452385

Change 452394 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing dns entries decom host db1052

https://gerrit.wikimedia.org/r/452394

Change 452394 merged by Cmjohnson:
[operations/dns@master] Removing dns entries decom host db1052

https://gerrit.wikimedia.org/r/452394

Cmjohnson updated the task description. (Show Details)Aug 13 2018, 3:18 PM
Cmjohnson moved this task from Decommission to UnRacking Tasks on the ops-eqiad board.
Cmjohnson closed this task as Resolved.Aug 21 2018, 5:05 PM
Cmjohnson updated the task description. (Show Details)