Page MenuHomePhabricator

install new disks into dbstore2001
Closed, ResolvedPublic

Description

This task can be used to track the installation of the new disks (ordered on parent task T143874) into existing dbstore2001. Please note these will replace existing disks, so the old disks will need to be wiped. If they are not wiped while installed in dbstore2001, they'll have to be installed into a spare machine and wiped from it. (It would be easier on @Papaul to wipe the disks while they are installed into dbstore2001, but that is likely an overnight downtime and not acceptable.)

  • - work with @Papaul, @jcrespo, and @Marostegui on when dbstore2001 can be depooled for disk removal.
  • - check if existing/old disks can be wiped while installed in dbstore2001, if not, move disks to a spare machine and wipe them in that machine
  • - install new 2TB disks into dbstore2001
  • - follow directions on server lifecycle reinstallation. This covers removing from monitoring and reinstalling the system. The script may not be able to be used due to the length of downtime from powering down and replacing disks and powering back up, it will likely need to be stepped through manually. (This includes removing old puppet/salt acceptance and then readding post installation.)

Event Timeline

RobH created this task.Oct 28 2016, 10:26 PM
Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptOct 28 2016, 10:26 PM
jcrespo removed Papaul as the assignee of this task.Oct 29 2016, 9:52 AM

I would say before touching dbstore2001- transferring its raw contents to dbstore2002 and labsdb1008. If possible, stop using labsdb1008 as a test and move it to production as interim sanitarium.

Check contents are ok and replication continues without problem; assign this to Papaul, then wipe dbstore2001; shutdown; remove disks (can they be reused- I do not need them, but maybe they are useful as replacements for something else). Install new disks, reimage. We will copy back the new contents, then continue doing T146261 now that it has enough capacity.

@Papaul we will claim this task until it is ready for you to do the wipe, install and reimage.

@Marostegui What do you think of the plan?

I am fine with that.
What I want to do:

  • Move the snapshot from dbstore2001 to dbstore2002 and labsdb1008 (needs coordination with Chase).
  • Build dbstore2002 from there (at least with 3 shards, s1,s3,s4) and once it is ready, reimage dbstore2001 with the new disks.

I don't want to have no dbstores for sometime (although right now dbstore2002 isn't too reliable, but at least it is there).

@jcrespo yo ok if I delete all the content of dbstore2002:/srv/sqldata today and move the snapshot from dbstore2001, start dbstore2002 and once that is done, we can continue with dbstore2001 and get it the new disks?

Mentioned in SAL (#wikimedia-operations) [2016-11-02T08:43:33Z] <marostegui> Stopping mysql dbstore2002 for maintenance - T149457

The transfer is now happening between dbstore2001 and dbstore2002

dbstore2002 is now up and running with the data from dbstore2001. I would like to leave it running for a few days to make sure it doesn't crash or gets anything weird before we kill dbstore2001.
Before killing dbstoer2001 I will take a last snapshot from it.

@Papaul you can ignore this ticket until I ping you again once we are ready to change disks in dbstore2001.

root@dbstore2002:/srv/sqldata# mysql --skip-ssl -e "nopager;show all slaves status\G"
PAGER set to stdout
*************************** 1. row ***************************
              Connection_name: s1
              Slave_SQL_State: updating
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db2016.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db2016-bin.002501
          Read_Master_Log_Pos: 292987194
               Relay_Log_File: dbstore2002-relay-bin-s1.000002
                Relay_Log_Pos: 22017401
        Relay_Master_Log_File: db2016-bin.002495
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table: %wik%.%,heartbeat.%
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 411251042
              Relay_Log_Space: 6195225588
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 178349
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180359172
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
         Retried_transactions: 0
           Max_relay_log_size: 1073741824
         Executed_log_entries: 144231
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-171970589-2642577996
*************************** 2. row ***************************
              Connection_name: s3
              Slave_SQL_State: update
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db2018.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db2018-bin.002509
          Read_Master_Log_Pos: 200522129
               Relay_Log_File: dbstore2002-relay-bin-s3.000002
                Relay_Log_Pos: 62146894
        Relay_Master_Log_File: db2018-bin.002501
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table: %wik%.%,heartbeat.%
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 526794153
              Relay_Log_Space: 8124698229
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 177605
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180359174
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
         Retried_transactions: 0
           Max_relay_log_size: 1073741824
         Executed_log_entries: 399564
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-171970589-2642577996
*************************** 3. row ***************************
              Connection_name: s4
              Slave_SQL_State: update
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db2019.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db2019-bin.001925
          Read_Master_Log_Pos: 197546301
               Relay_Log_File: dbstore2002-relay-bin-s4.000002
                Relay_Log_Pos: 115536510
        Relay_Master_Log_File: db2019-bin.001916
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table: %wik%.%,heartbeat.%
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 699804217
              Relay_Log_Space: 9050472244
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 176879
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180359175
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
         Retried_transactions: 0
           Max_relay_log_size: 1073741824
         Executed_log_entries: 667408
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-171970589-2642577996
Marostegui moved this task from Triage to In progress on the DBA board.Nov 3 2016, 9:17 AM

dbstore2002 caught up. \o/

I am going to do a few tests to make sure it is fine and if so, on Sunday I will take a final snapshot of dbstore2001, move it to dbstore2002 and we can probably go ahead and change disks on dbstore2001.
I will update this ticket on Monday

dbstore2002 looks good, stopping and starting slaves, the mysqld process and so forth shows no errors.

I have seen that the tokudb plugin cannot be loaded

161104 11:15:59 [ERROR] TokuDB unknown error -100005
161104 11:15:59 [ERROR] Plugin 'TokuDB' init function returned error.
161104 11:15:59 [ERROR] Plugin 'TokuDB' registration as a STORAGE ENGINE failed.

But that might be related to the package? 10.0.22 and the fact that all the data was copied from dbstore2001 (10.0.27).
We cannot upgrade dbstore2002 to 10.0.27 as it is running Trusty.

TokuDB works fine on dbstore2001 and loaded fine there too, so I am not too worried about this issue with dbstore2002, what do you think @jcrespo?

@Papaul can we do this on Thursday?

On Wednesday night I will take a snapshot of dbstore2001 so by Thursday we should be good to go on Thursday.

I have been talking to @Volans as I was not sure if we needed the host to be on for the script to work. But we believe it will work fine.
So you can just change the disks and leave the server off and we will take it from there.

Let me know if Thursday works for you.
Thanks!

Papaul added a comment.Nov 8 2016, 5:13 PM

@Marostegui yes Thursday 10:00 am works for me.

Great thank you! I will wait for you and once you are around I will shutdown the server then
Thanks!

Mentioned in SAL (#wikimedia-operations) [2016-11-09T18:47:30Z] <marostegui> Stopping MySQL dbstore2001 - taking a snapshot - T149457

The snapshots are taken so dbstore2001 is ready to get the new disks. Later today.

I have placed them at: dbstore2002:/srv/tmp there is one from 7th Nov and another one from last night.

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['dbstore2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201611101638_marostegui_16094.log.

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['dbstore2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201611101652_marostegui_25467.log.

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['dbstore2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201611101656_marostegui_29639.log.

The reason for so many script runs is that the server doesn't get reimaged if it not on.

Set Chassis Power Control to Cycle failed: Command not supported in present state

After talking to @Volans we decided to power the server on from the ILO manually and launch the script (with the new option --new) as soon as the ILO said that the server was on.
The script was able to go thru instead of getting stuck on Set Boot Device to pxe

Ricciardo is already looking into ways of solving this issue.

The issue with wmf-reimage is tracked in T150448

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['dbstore2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201611101806_marostegui_8360.log.

Completed auto-reimage of hosts:

['dbstore2001.codfw.wmnet']

and were ALL successful.

The server got reinstalled and looks good:

root@dbstore2001:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:   	Debian GNU/Linux 8.6 (jessie)
Release:       	8.6
Codename:      	jessie

root@dbstore2001:~# df -hT /srv/
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    11T   34M   11T   1% /srv

root@dbstore2001:~# dpkg -l| grep mariadb
ii  wmf-mariadb10                  10.0.28-1                            amd64        MariaDB plus patches.

I have already started the snapshot transfer to it.

Thanks @Papaul for all your help!

The data copy finished and after running mysql_upgrade I have started replication and the slaves are catching up nicely with the master.

I forgot to include the RAID configuration for the record:

root@dbstore2001:~# hpssacli ctrl all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 001438032A7F8A0)


   Gen8 ServBP 25+2 at Port 1I, Box 1, OK

   Gen8 ServBP 25+2 at Port 2I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (10.9 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 2 TB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 2 TB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 2 TB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 2 TB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 2 TB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 2 TB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 2 TB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 2 TB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 2 TB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 2 TB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 2 TB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 2 TB, OK)

   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 25+2) 376  (WWID: 50014380314CCDF3, Port: 2I, Box: 1)

   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 25+2) 377  (WWID: 50014380314CCDD3, Port: 1I, Box: 1)

   Expander 379  (WWID: 50014380314CCDE0, Port: 2I, Box: 1)

   Expander 380  (WWID: 50014380314CCDC0, Port: 1I, Box: 1)

   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 378  (WWID: 5001438032A7F8AF)

I believe the scope of this ticket is done and can be closed.

@Papaul the disks still need to be wiped, is that something you can do or something we have to do?

I will leave this ticket open until you let us know.

Thanks

dbstore2001 caught up without any problems.
This ticket is ready to be closed as soon as @Papaul confirms that the old disks are wiped.

Papaul closed this task as Resolved.Nov 21 2016, 3:43 PM

The old disks are wiped. Good to close this task.