dbstore1004 85% disk space used.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	May 19 2021, 5:03 AM

Description

dbstore1004 is the analytics host that holds s2, s3 and s4.
It is now using 85% of its disk usage, mainly cause s4 (which is compressed already).
There's not much else we can do there apart from replacing this host.

My understanding is that Analytics didn't budget for having these hosts (dbstore*) expanded, so this requires some reorganization of our databases to find a host to replace this one with.

We can do the data transfer and db reorganization, but Analytics would need to do all the other side of things.

Details

Subject	Repo	Branch	Lines +/-
Add dbstore1006 to analytics vlan	operations/homer/public	master	+2 -0
Fix the dbstore1007 IP after changing VLAN in analytics-in4	operations/homer/public	master	+1 -1
Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones	operations/dns	master	+6 -6
dbstore1007: Enable notifications	operations/puppet	production	+0 -1
install_server: Do not format /srv on dbstore1007	operations/puppet	production	+1 -2
Add dbstore1007 to analytics firewall	operations/homer/public	master	+2 -0
site: remove dbstore1006 from mariadb role	operations/puppet	production	+1 -1
site: give mariadb role to dbstore1007	operations/puppet	production	+6 -14
site.pp: Fix dbstore1007 role	operations/puppet	production	+1 -1
site: add dbstore1007, reimaged from db1183	operations/puppet	production	+10 -5
dbstore1004: Disable notifications	operations/puppet	production	+1 -0
install_server: Do not format /srv on dbstore1006	operations/puppet	production	+1 -2
site: add role for dbstore1006	operations/puppet	production	+1 -5
netboot: Change dbstore1006 netboot.cfg to partman/custom/db.cfg	operations/puppet	production	+2 -1
site: configure dbstore1006 as insetup	operations/puppet	production	+14 -0
db1125: decommission db1125	operations/puppet	production	+2 -10

Related Objects
Search...

Status	Assigned	Task
Resolved	• razzi	T283125 dbstore1004 85% disk space used.
Resolved	• Kormat	T284128 Re-image (rename) dbstore1006 into db1125
Resolved	• Kormat	T284622 Rename dbstore1004 to db1183 and place it on m5
Resolved	Jclark-ctr	T286468 Relabel dbstore1004 to db1183

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 693224 merged by Razzi:

[operations/puppet@production] netboot: Change dbstore1006 netboot.cfg to partman/custom/db.cfg

https://gerrit.wikimedia.org/r/693224

Change 693230 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: add role for dbstore1006

https://gerrit.wikimedia.org/r/693230

Change 693230 merged by Razzi:

[operations/puppet@production] site: add role for dbstore1006

https://gerrit.wikimedia.org/r/693230

Hmm, the machine has been renamed and is almost operational, but doesn't have ssh keys so I can't log in. I'm not sure what to do at this point, I tried adding the mariadb::dbstore_multiinstance role in case the problem was that it didn't have a role, but that didn't do anything. If you know what to do from here @Marostegui feel free to go for it; otherwise I'll ask around SRE tomorrow.

@razzi I have fixed the issue and the host is now accessible. The certificated wasn't signed by puppet, so I have done so manually - did you get any error during the installation?.
However, I would prefer if you can attempt the full reimage again to make sure the host reimages just fine without having to do any manual trick, as otherwise, there might be something wrong somewhere which can be faced in future reimages.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['dbstore1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105210505_marostegui_18888.log.

Completed auto-reimage of hosts:

['dbstore1006.eqiad.wmnet']

and were ALL successful.

@razzi all went fine this time, do you remember if you used the --new thing? Maybe that was the issue.
Anyways, I have executed:

root@dbstore1006:~# sudo lvextend -L+1100G /dev/mapper/tank-data && sudo xfs_growfs /srv
  Size of logical volume tank/data changed from <7.56 TiB (1981022 extents) to 8.63 TiB (2262622 extents).
  Logical volume tank/data successfully resized.
meta-data=/dev/mapper/tank-data  isize=512    agcount=8, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2028566528, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 2028566528 to 2316924928

Let me know when I can stop mysql on dbstore1004 to proceed and migrate the data.

Change 693296 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Do not format /srv on dbstore1006

https://gerrit.wikimedia.org/r/693296

@Marostegui glad you were able to figure that out and that it worked on a new reimage. My last attempt timed out, and I was troubleshooting some network issues that might not have been fully resolved when I started the reimage.

As for when dbstore1004 can be stopped, I'm not sure exactly how it's being used, perhaps @Milimetric or @JAllemandou can chime in?

I'm curious if it would be possible to keep 1004 up and running while 1006 is populated - I'm guessing it's possible but more trouble since the data is transferred as a snapshot which would miss the updates happening during the transfer.

In T283125#7104376, @razzi wrote:

@Marostegui glad you were able to figure that out and that it worked on a new reimage. My last attempt timed out, and I was troubleshooting some network issues that might not have been fully resolved when I started the reimage.

As for when dbstore1004 can be stopped, I'm not sure exactly how it's being used, perhaps @Milimetric or @JAllemandou can chime in?

If it helps, we can stop the sections it contains at different days/times (it has: s2, s3 and s4).

I'm curious if it would be possible to keep 1004 up and running while 1006 is populated - I'm guessing it's possible but more trouble since the data is transferred as a snapshot which would miss the updates happening during the transfer.

We could using xtrabackup, but it is a bit of some more overhead for us so I would prefer to stop mysql for a few hours and copy all the content.
If it is absolutely necessary to keep it up at all times we can do it via xtrabackup.

@Marostegui @razzi dbstore1004 can be stopped any time, we use it in two places:

user queries
start-of-the-month jobs

If the downtime for 1004 is days we can announce it, but IIUC it will be only to copy data so we can proceed any time.

@razzi remember to update the Analytics VLAN firewall rules for dbstore1006 when everything is finished, otherwise we'll not be able to use it from the Analytics hosts.

Change 693296 merged by Marostegui:

[operations/puppet@production] install_server: Do not format /srv on dbstore1006

https://gerrit.wikimedia.org/r/693296

Thanks @elukey - I am on clinic duty this week, so we'll see if I have time for this :(

Change 694002 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/homer/public@master] Add dbstore1006 to analytics vlan

https://gerrit.wikimedia.org/r/694002

Change 694024 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbstore1004: Disable notifications

https://gerrit.wikimedia.org/r/694024

Mentioned in SAL (#wikimedia-operations) [2021-05-25T04:25:40Z] <marostegui> Stop MySQL on dbstore1004 to clone dbstore1006 T283125

Change 694024 merged by Marostegui:

[operations/puppet@production] dbstore1004: Disable notifications

https://gerrit.wikimedia.org/r/694024

Transfer started.

@razzi I forgot to mention that DNS CNAME/SRV records are also to update, otherwise the various tools that we use will not work:

templates/wmnet:s2-analytics-replica  5M  IN CNAME    dbstore1004.eqiad.wmnet.
templates/wmnet:s3-analytics-replica  5M  IN CNAME    dbstore1004.eqiad.wmnet.
templates/wmnet:s4-analytics-replica  5M  IN CNAME    dbstore1004.eqiad.wmnet.
templates/wmnet:_s2-analytics._tcp       5M  IN SRV      0 1 3312 dbstore1004.eqiad.wmnet.
templates/wmnet:_s3-analytics._tcp       5M  IN SRV      0 1 3313 dbstore1004.eqiad.wmnet.
templates/wmnet:_s4-analytics._tcp       5M  IN SRV      0 1 3314 dbstore1004.eqiad.wmnet.

I have had to remove the ipv6 dns due to: T270101

@razzi the data is cloned, however the host cannot reach any of the masters, I guess there are some FW/VLAN rules that need changing?
I am checking the tables for now, to make sure everything went well. Once the connectivity is fixed, replication will start automatically

Pending: Enable GTID

root@dbstore1006:/srv# telnet db1122.eqiad.wmnet 3306
Trying 10.64.48.34...
^C
root@dbstore1006:/srv# telnet db1123.eqiad.wmnet 3306
Trying 10.64.48.35...
^C
root@dbstore1006:/srv# telnet db1138.eqiad.wmnet 3306
Trying 10.64.48.124...

^C

@razzi @elukey @Ottomata I am afraid we need to re-do all this work. I just noticed that db1125 isn't the standard HW we have, but one of the old and snowflake we bought years ago. It doesn't have 512GB RAM, but 256GB, which is probably not enough for dbstore1006's performance.
I thought we got rid of all these hosts, but we didn't :-(
My apologies.

We'd need to take db1183 for this. We can decommission dbstore1006 and rename db1183 (with the same process we did) to dbstore1006 (or 1007, whatever is easier).
I will then take this host (dbstore1006/db1125) and use it somewhere else.

Let me know if you want to re-use this task or you prefer me to create a new one.

To sum up:

dbstore1006 to be decommissioned (manuel will use this host for something else)
db1183 to be converted into a dbstore
- once that is done: dbstore1004 to be decommissioned (manuel will take that host for something else).

I have taken care of all the stuff from our side, so db1183 is now ready to be reimaged at your convenience. Let me know if you want me to decommission dbstore1006 or you do it, so I can re-use it somewhere else.

@razzi do you have an ETA on when do you will resume this work? Thanks! (host is 87% now)

@Marostegui I'll reimage db1183 today, should be set for you to work on it tomorrow.

@razzi thanks - let me know when I can proceed

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: db1183.eqiad.wmnet

db1183.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Change 697704 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/homer/public@master] Add dbstore1007 to analytics firewall

https://gerrit.wikimedia.org/r/697704

Change 697705 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: add dbstore1007, reimaged from db1183

https://gerrit.wikimedia.org/r/697705

Change 697705 merged by Razzi:

[operations/puppet@production] site: add dbstore1007, reimaged from db1183

https://gerrit.wikimedia.org/r/697705

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

db1183.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106020624_razzi_7617_db1183_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1183.eqiad.wmnet']

Of which those FAILED:

['db1183.eqiad.wmnet']

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

dbstore1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106020638_razzi_10920_dbstore1007_eqiad_wmnet.log.

• razzi mentioned this in T284126: Relabel db1183 to be dbstore1007.Jun 2 2021, 6:51 AM

Change 697706 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: give mariadb role to dbstore1007

https://gerrit.wikimedia.org/r/697706

Marostegui mentioned this in T284128: Re-image (rename) dbstore1006 into db1125.Jun 2 2021, 6:59 AM

Completed auto-reimage of hosts:

['dbstore1007.eqiad.wmnet']

and were ALL successful.

Thanks @razzi - could you or @elukey let me know if I can stop this host? (Given it is the start of the month, not sure if it is being used)

No go for this week :( (assuming that you want to stop dbstore100[3-5])

No worries - I will ping again on Monday next week

Change 697706 merged by Marostegui:

[operations/puppet@production] site: give mariadb role to dbstore1007

https://gerrit.wikimedia.org/r/697706

Change 697833 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Fix dbstore1007 role

https://gerrit.wikimedia.org/r/697833

Change 697833 merged by Marostegui:

[operations/puppet@production] site.pp: Fix dbstore1007 role

https://gerrit.wikimedia.org/r/697833

Change 697834 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: remove dbstore1006 from mariadb role

https://gerrit.wikimedia.org/r/697834

Change 697834 merged by Razzi:

[operations/puppet@production] site: remove dbstore1006 from mariadb role

https://gerrit.wikimedia.org/r/697834

• Kormat closed subtask T284128: Re-image (rename) dbstore1006 into db1125 as Resolved.Jun 3 2021, 12:44 PM

Mentioned in SAL (#wikimedia-operations) [2021-06-07T05:57:25Z] <marostegui> Stop dbstore1004 to clone dbstore1007 T283125

Replication positions:

P16309 (An Untitled Masterwork)

2====== 3* 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57====== 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110111 112====== 113114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165166 class="paste-embed-body" style="max-height: 27.6em;">

1	root@dbstore1004:/srv# for i in 2 3 4; do echo "====== $i ======"; mysql -S /run/mysqld/mysqld.s$i.sock -e "show slave status\G" ; done 2 ====== ************************ 1. row *********************** Slave_IO_State: Master_Host: db1122.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1122-bin.002981 Read_Master_Log_Pos: 987954888 Relay_Log_File: dbstore1004-relay-bin.000072 Relay_Log_Pos: 987955184 Relay_Master_Log_File: db1122-bin.002981 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 987954888 Relay_Log_Space: 987955539 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 171978786 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 171970567-171970567-390719906,0-180359173-4858865027,180359173-180359173-70817914,171978786-171978786-3193273639,180359271-180359271-332498589,171966574-171966574-2221092918,180359241-180359241-121693516,171966670-171966670-2410812544 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 63183171 3 ====== *********************** 1. row *********************** Slave_IO_State: Master_Host: db1123.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1123-bin.004324 Read_Master_Log_Pos: 466847508 Relay_Log_File: dbstore1004-relay-bin.000106 Relay_Log_Pos: 466847804 Relay_Master_Log_File: db1123-bin.004324 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 466847508 Relay_Log_Space: 466848159 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 171978787 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 171978787-171978787-3089023103,0-171966669-4075108480,171966669-171966669-4196523483,180359174-180359174-94123433,180363367-180363367-134174373,171974792-171974792-378345284,180355192-180355192-321029096 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 0 >Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 71748739 4 ====== >*********************** 1. row ************************* Slave_IO_State: Master_Host: db1138.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1138-bin.004144 Read_Master_Log_Pos: 173967607 Relay_Log_File: dbstore1004-relay-bin.000124 Relay_Log_Pos: 173967903 Relay_Master_Log_File: db1138-bin.004144 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 173967607 Relay_Log_Space: 173968258 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 171978876 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-180359175-3368394787,171978876-171978876-2925644422,171966557-171966557-1948492266,180363436-180363436-1155411339,171978775-171978775-4822899280,180359175-180359175-43143523,171970589-171970589-201132050,180359190-180359190-192195477 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 0 >Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 94416129

root@dbstore1004:/srv# for i in 2 3 4; do echo "====== $i ======"; mysql -S /run/mysqld/mysqld.s$i.sock -e "show slave status\G" ; done 2 ====== ************************** 1. row *************************** Slave_IO_State: Master_Host: db1122.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1122-bin.002981 Read_Master_Log_Pos: 987954888 Relay_Log_File: dbstore1004-relay-bin.000072 Relay_Log_Pos: 987955184 Relay_Master_Log_File: db1122-bin.002981 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 987954888 Relay_Log_Space: 987955539 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 171978786 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 171970567-171970567-390719906,0-180359173-4858865027,180359173-180359173-70817914,171978786-171978786-3193273639,180359271-180359271-332498589,171966574-171966574-2221092918,180359241-180359241-121693516,171966670-171966670-2410812544 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 63183171 3 ====== *************************** 1. row *************************** Slave_IO_State: Master_Host: db1123.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1123-bin.004324 Read_Master_Log_Pos: 466847508 Relay_Log_File: dbstore1004-relay-bin.000106 Relay_Log_Pos: 466847804 Relay_Master_Log_File: db1123-bin.004324 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 466847508 Relay_Log_Space: 466848159 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 171978787 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 171978787-171978787-3089023103,0-171966669-4075108480,171966669-171966669-4196523483,180359174-180359174-94123433,180363367-180363367-134174373,171974792-171974792-378345284,180355192-180355192-321029096 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 0 >Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 71748739 4 ====== >*************************** 1. row *************************** Slave_IO_State: Master_Host: db1138.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1138-bin.004144 Read_Master_Log_Pos: 173967607 Relay_Log_File: dbstore1004-relay-bin.000124 Relay_Log_Pos: 173967903 Relay_Master_Log_File: db1138-bin.004144 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 173967607 Relay_Log_Space: 173968258 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 171978876 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-180359175-3368394787,171978876-171978876-2925644422,171966557-171966557-1948492266,180363436-180363436-1155411339,171978775-171978775-4822899280,180359175-180359175-43143523,171970589-171970589-201132050,180359190-180359190-192195477 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave_DDL_Groups: 0 >Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 94416129

root@dbstore1007:/srv# sudo lvextend -L+1100G /dev/mapper/tank-data && sudo xfs_growfs /srv
  Size of logical volume tank/data changed from <7.56 TiB (1981022 extents) to 8.63 TiB (2262622 extents).
  Logical volume tank/data successfully resized.
meta-data=/dev/mapper/tank-data  isize=512    agcount=8, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2028566528, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 2028566528 to 2316924928
root@dbstore1007:/srv# df -hT
Filesystem            Type      Size  Used Avail Use% Mounted on
udev                  devtmpfs  252G     0  252G   0% /dev
tmpfs                 tmpfs      51G   90M   51G   1% /run
/dev/sda1             ext4       37G  3.4G   32G  10% /
tmpfs                 tmpfs     252G     0  252G   0% /dev/shm
tmpfs                 tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                 tmpfs     252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data xfs       8.7T   15G  8.7T   1% /srv
tmpfs                 tmpfs      51G     0   51G   0% /run/user/15343
tmpfs                 tmpfs      51G     0   51G   0% /run/user/0
root@dbstore1007:/srv# pvs
  PV         VG   Fmt  Attr PSize  PFree
  /dev/sda3  tank lvm2 a--  <8.69t 56.30g

Transfer between dbstore1004 and dbstore1007 started

@razzi dbstore1007 still needs to get the proper FW rules (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/697704), as it cannot reach the masters yet:

root@dbstore1007:~# telnet db1122.eqiad.wmnet 3306
Trying 10.64.48.34...

This host needs ipv6 dns to be deleted from netbox

In T283125#7138250, @Marostegui wrote:

This host needs ipv6 dns to be deleted from netbox

Done

Host added to tendril and zarcillo.
Set to active on Netbox

@razzi transfer has finished and I have configured replication. As soon as you push the new firewall rules, it will start catching up automatically.

Change 697704 merged by jenkins-bot:

[operations/homer/public@master] Add dbstore1007 to analytics firewall

https://gerrit.wikimedia.org/r/697704

• razzi moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Jun 7 2021, 2:48 PM

jenkins-bot mentioned this in rOHPU0dc241c34b92: Add dbstore1007 to analytics firewall.Jun 7 2021, 2:49 PM

@Marostegui new firewall rules are pushed, thanks for the update on your end.

As mentioned on IRC, there might be something else needed as it cannot reach its master yet:

root@dbstore1007:~# telnet db1122.eqiad.wmnet 3306
Trying 10.64.48.34...
^C

@razzi some more things to double check:

[17:57:10]  <@marostegui> Also, is dbstore1007 in the same vlan as dbstore1004?
[17:57:56]  <volans> analytics1-d-eqiad
[17:57:58]  <volans> https://netbox.wikimedia.org/dcim/interfaces/18116/
[17:58:36]  <volans> while dbstore1004 is in private1-b-eqiad
[17:58:36]  <volans> https://netbox.wikimedia.org/dcim/interfaces/16412/
[17:58:45]  <@marostegui> so I guess that's an issue too
[17:58:46]  <volans> so no :)
[17:58:54]  <volans> according to netbox

I am not sure if it is going to be a nightmare or not, but to avoid wiping the copy that Manuel did between 1004 and 1007 today with a reimage we could:

ifdown interface on dbstore1007 manually (+ downtime etc..)
change VLAN on nebox + cookbook to get the new IP(s)
run homer to fix the VLAN tagging on the port
connect to mgmt and set etc's network interfaces manually + ifup

We can reimage without wiping /srv if needed too

The reimage is surely good, but I think that we'd need to fix the ips manually anyway in netbox first. @Volans do you have suggestions about what's best? :)

@elukey what do you need to change, just the vlan hence the IP? Ping me tomorrow and we can do it together.

Change 698656 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Do not format /srv on dbstore1007

https://gerrit.wikimedia.org/r/698656

Change 698656 merged by Marostegui:

[operations/puppet@production] install_server: Do not format /srv on dbstore1007

https://gerrit.wikimedia.org/r/698656

@elukey @razzi I have merged the above patch, which allows dbstore1007 to be reimaged without formatting its /srv.
It needs to have their mysql instances stopped before the reimage though.

I have actually stopped mysql, as they are not replicating anyways. So we can reimage this host anytime.

Manual changes to Netbox have been done by me and @elukey, namely:

deleted the IPv4 and IPv6 from dbstore1007
selected the VLAN for private1-d-eqiad, opened both IPv4 and IPv6 prefixes, assigned a new IPv4 with dns name and marked as primary, assigned a new IPv6 with the mapped IPv4 without dns and marked as primary
updated the port vlan assignation on the switch side

All changes can be seen in the Netbox changelog

@Marostegui the new ips seems to work as expected, if you want to kick of a reimage please go ahead but it should run fine from now on in theory (just verified a telnet to db1122 and it worked). Let me know if it looks ok now or not!

Change 698729 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones

https://gerrit.wikimedia.org/r/698729

In T283125#7141412, @elukey wrote:

@Marostegui the new ips seems to work as expected, if you want to kick of a reimage please go ahead but it should run fine from now on in theory (just verified a telnet to db1122 and it worked). Let me know if it looks ok now or not!

It looks good!!! Do you still think we need the reimage? If we can reach the masters, from my side there's no reimage needed. Anything from your side requiring it?

I don't think it is needed, we can proceed with what we have :)

MySQL started and catching up! Thank you all!! (including @Volans!)
I expect it to be in sync with the master by tomorrow - once done, I will enable GTID.

Once it has caught up, is there anything else needed from your side @razzi @elukey in order to make this host "active" and so we can proceed with dbstore1004 stop and repurpose?

dbstore1007: s2, s3 and s4 is now up-to-date.
GTID is in place.

@razzi @elukey anything else to be done from our side before we can stop and recycle dbstore1004?

Change 698794 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbstore1007: Enable notifications

https://gerrit.wikimedia.org/r/698794

Change 698794 merged by Marostegui:

[operations/puppet@production] dbstore1007: Enable notifications

https://gerrit.wikimedia.org/r/698794

Yep we need to merge https://gerrit.wikimedia.org/r/698729 and verify that everything works as expected :)

Excellent - let me know when we can proceed with the future plans for dbstore1004.
Feel free to close this task once you are done from your side. I will create a different task to repurpose dbstore1004 once this one is closed.

Change 698729 merged by Elukey:

[operations/dns@master] Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones

https://gerrit.wikimedia.org/r/698729

Change 698824 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Fix the dbstore1007 IP after changing VLAN in analytics-in4

https://gerrit.wikimedia.org/r/698824

Change 698824 merged by Elukey:

[operations/homer/public@master] Fix the dbstore1007 IP after changing VLAN in analytics-in4

https://gerrit.wikimedia.org/r/698824

elukey mentioned this in rOHPU69fe6c1e820a: Fix the dbstore1007 IP after changing VLAN in analytics-in4.Jun 8 2021, 5:03 PM