Page MenuHomePhabricator

dbstore1004 85% disk space used.
Closed, ResolvedPublic

Description

dbstore1004 is the analytics host that holds s2, s3 and s4.
It is now using 85% of its disk usage, mainly cause s4 (which is compressed already).
There's not much else we can do there apart from replacing this host.

My understanding is that Analytics didn't budget for having these hosts (dbstore*) expanded, so this requires some reorganization of our databases to find a host to replace this one with.

We can do the data transfer and db reorganization, but Analytics would need to do all the other side of things.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 693224 merged by Razzi:

[operations/puppet@production] netboot: Change dbstore1006 netboot.cfg to partman/custom/db.cfg

https://gerrit.wikimedia.org/r/693224

Change 693230 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: add role for dbstore1006

https://gerrit.wikimedia.org/r/693230

Change 693230 merged by Razzi:

[operations/puppet@production] site: add role for dbstore1006

https://gerrit.wikimedia.org/r/693230

Hmm, the machine has been renamed and is almost operational, but doesn't have ssh keys so I can't log in. I'm not sure what to do at this point, I tried adding the mariadb::dbstore_multiinstance role in case the problem was that it didn't have a role, but that didn't do anything. If you know what to do from here @Marostegui feel free to go for it; otherwise I'll ask around SRE tomorrow.

@razzi I have fixed the issue and the host is now accessible. The certificated wasn't signed by puppet, so I have done so manually - did you get any error during the installation?.
However, I would prefer if you can attempt the full reimage again to make sure the host reimages just fine without having to do any manual trick, as otherwise, there might be something wrong somewhere which can be faced in future reimages.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['dbstore1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105210505_marostegui_18888.log.

Completed auto-reimage of hosts:

['dbstore1006.eqiad.wmnet']

and were ALL successful.

@razzi all went fine this time, do you remember if you used the --new thing? Maybe that was the issue.
Anyways, I have executed:

root@dbstore1006:~# sudo lvextend -L+1100G /dev/mapper/tank-data && sudo xfs_growfs /srv
  Size of logical volume tank/data changed from <7.56 TiB (1981022 extents) to 8.63 TiB (2262622 extents).
  Logical volume tank/data successfully resized.
meta-data=/dev/mapper/tank-data  isize=512    agcount=8, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2028566528, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 2028566528 to 2316924928

Let me know when I can stop mysql on dbstore1004 to proceed and migrate the data.

Change 693296 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Do not format /srv on dbstore1006

https://gerrit.wikimedia.org/r/693296

@Marostegui glad you were able to figure that out and that it worked on a new reimage. My last attempt timed out, and I was troubleshooting some network issues that might not have been fully resolved when I started the reimage.

As for when dbstore1004 can be stopped, I'm not sure exactly how it's being used, perhaps @Milimetric or @JAllemandou can chime in?

I'm curious if it would be possible to keep 1004 up and running while 1006 is populated - I'm guessing it's possible but more trouble since the data is transferred as a snapshot which would miss the updates happening during the transfer.

@Marostegui glad you were able to figure that out and that it worked on a new reimage. My last attempt timed out, and I was troubleshooting some network issues that might not have been fully resolved when I started the reimage.

As for when dbstore1004 can be stopped, I'm not sure exactly how it's being used, perhaps @Milimetric or @JAllemandou can chime in?

If it helps, we can stop the sections it contains at different days/times (it has: s2, s3 and s4).

I'm curious if it would be possible to keep 1004 up and running while 1006 is populated - I'm guessing it's possible but more trouble since the data is transferred as a snapshot which would miss the updates happening during the transfer.

We could using xtrabackup, but it is a bit of some more overhead for us so I would prefer to stop mysql for a few hours and copy all the content.
If it is absolutely necessary to keep it up at all times we can do it via xtrabackup.

@Marostegui @razzi dbstore1004 can be stopped any time, we use it in two places:

  1. user queries
  2. start-of-the-month jobs

If the downtime for 1004 is days we can announce it, but IIUC it will be only to copy data so we can proceed any time.

@razzi remember to update the Analytics VLAN firewall rules for dbstore1006 when everything is finished, otherwise we'll not be able to use it from the Analytics hosts.

Change 693296 merged by Marostegui:

[operations/puppet@production] install_server: Do not format /srv on dbstore1006

https://gerrit.wikimedia.org/r/693296

Thanks @elukey - I am on clinic duty this week, so we'll see if I have time for this :(

Change 694002 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/homer/public@master] Add dbstore1006 to analytics vlan

https://gerrit.wikimedia.org/r/694002

Change 694024 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbstore1004: Disable notifications

https://gerrit.wikimedia.org/r/694024

Mentioned in SAL (#wikimedia-operations) [2021-05-25T04:25:40Z] <marostegui> Stop MySQL on dbstore1004 to clone dbstore1006 T283125

Change 694024 merged by Marostegui:

[operations/puppet@production] dbstore1004: Disable notifications

https://gerrit.wikimedia.org/r/694024

@razzi I forgot to mention that DNS CNAME/SRV records are also to update, otherwise the various tools that we use will not work:

templates/wmnet:s2-analytics-replica  5M  IN CNAME    dbstore1004.eqiad.wmnet.
templates/wmnet:s3-analytics-replica  5M  IN CNAME    dbstore1004.eqiad.wmnet.
templates/wmnet:s4-analytics-replica  5M  IN CNAME    dbstore1004.eqiad.wmnet.
templates/wmnet:_s2-analytics._tcp       5M  IN SRV      0 1 3312 dbstore1004.eqiad.wmnet.
templates/wmnet:_s3-analytics._tcp       5M  IN SRV      0 1 3313 dbstore1004.eqiad.wmnet.
templates/wmnet:_s4-analytics._tcp       5M  IN SRV      0 1 3314 dbstore1004.eqiad.wmnet.

I have had to remove the ipv6 dns due to: T270101

@razzi the data is cloned, however the host cannot reach any of the masters, I guess there are some FW/VLAN rules that need changing?
I am checking the tables for now, to make sure everything went well. Once the connectivity is fixed, replication will start automatically

Pending: Enable GTID

root@dbstore1006:/srv# telnet db1122.eqiad.wmnet 3306
Trying 10.64.48.34...
^C
root@dbstore1006:/srv# telnet db1123.eqiad.wmnet 3306
Trying 10.64.48.35...
^C
root@dbstore1006:/srv# telnet db1138.eqiad.wmnet 3306
Trying 10.64.48.124...

^C

@razzi @elukey @Ottomata I am afraid we need to re-do all this work. I just noticed that db1125 isn't the standard HW we have, but one of the old and snowflake we bought years ago. It doesn't have 512GB RAM, but 256GB, which is probably not enough for dbstore1006's performance.
I thought we got rid of all these hosts, but we didn't :-(
My apologies.

We'd need to take db1183 for this. We can decommission dbstore1006 and rename db1183 (with the same process we did) to dbstore1006 (or 1007, whatever is easier).
I will then take this host (dbstore1006/db1125) and use it somewhere else.

Let me know if you want to re-use this task or you prefer me to create a new one.

To sum up:

  • dbstore1006 to be decommissioned (manuel will use this host for something else)
  • db1183 to be converted into a dbstore
    • once that is done: dbstore1004 to be decommissioned (manuel will take that host for something else).

I have taken care of all the stuff from our side, so db1183 is now ready to be reimaged at your convenience. Let me know if you want me to decommission dbstore1006 or you do it, so I can re-use it somewhere else.

@razzi do you have an ETA on when do you will resume this work? Thanks! (host is 87% now)

@Marostegui I'll reimage db1183 today, should be set for you to work on it tomorrow.

@razzi thanks - let me know when I can proceed

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: db1183.eqiad.wmnet

  • db1183.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 697704 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/homer/public@master] Add dbstore1007 to analytics firewall

https://gerrit.wikimedia.org/r/697704

Change 697705 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: add dbstore1007, reimaged from db1183

https://gerrit.wikimedia.org/r/697705

Change 697705 merged by Razzi:

[operations/puppet@production] site: add dbstore1007, reimaged from db1183

https://gerrit.wikimedia.org/r/697705

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

db1183.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106020624_razzi_7617_db1183_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1183.eqiad.wmnet']

Of which those FAILED:

['db1183.eqiad.wmnet']

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

dbstore1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106020638_razzi_10920_dbstore1007_eqiad_wmnet.log.

Change 697706 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: give mariadb role to dbstore1007

https://gerrit.wikimedia.org/r/697706

Completed auto-reimage of hosts:

['dbstore1007.eqiad.wmnet']

and were ALL successful.

Thanks @razzi - could you or @elukey let me know if I can stop this host? (Given it is the start of the month, not sure if it is being used)

No go for this week :( (assuming that you want to stop dbstore100[3-5])

No worries - I will ping again on Monday next week

Change 697706 merged by Marostegui:

[operations/puppet@production] site: give mariadb role to dbstore1007

https://gerrit.wikimedia.org/r/697706

Change 697833 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Fix dbstore1007 role

https://gerrit.wikimedia.org/r/697833

Change 697833 merged by Marostegui:

[operations/puppet@production] site.pp: Fix dbstore1007 role

https://gerrit.wikimedia.org/r/697833

Change 697834 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] site: remove dbstore1006 from mariadb role

https://gerrit.wikimedia.org/r/697834

Change 697834 merged by Razzi:

[operations/puppet@production] site: remove dbstore1006 from mariadb role

https://gerrit.wikimedia.org/r/697834

Mentioned in SAL (#wikimedia-operations) [2021-06-07T05:57:25Z] <marostegui> Stop dbstore1004 to clone dbstore1007 T283125

Replication positions:

1root@dbstore1004:/srv# for i in 2 3 4; do echo "====== $i ======"; mysql -S /run/mysqld/mysqld.s$i.sock -e "show slave status\G" ; done
2====== 2 ======
3*************************** 1. row ***************************
4 Slave_IO_State:
5 Master_Host: db1122.eqiad.wmnet
6 Master_User: repl
7 Master_Port: 3306
8 Connect_Retry: 60
9 Master_Log_File: db1122-bin.002981
10 Read_Master_Log_Pos: 987954888
11 Relay_Log_File: dbstore1004-relay-bin.000072
12 Relay_Log_Pos: 987955184
13 Relay_Master_Log_File: db1122-bin.002981
14 Slave_IO_Running: No
15 Slave_SQL_Running: No
16 Replicate_Do_DB:
17 Replicate_Ignore_DB:
18 Replicate_Do_Table:
19 Replicate_Ignore_Table:
20 Replicate_Wild_Do_Table:
21 Replicate_Wild_Ignore_Table:
22 Last_Errno: 0
23 Last_Error:
24 Skip_Counter: 0
25 Exec_Master_Log_Pos: 987954888
26 Relay_Log_Space: 987955539
27 Until_Condition: None
28 Until_Log_File:
29 Until_Log_Pos: 0
30 Master_SSL_Allowed: Yes
31 Master_SSL_CA_File:
32 Master_SSL_CA_Path:
33 Master_SSL_Cert:
34 Master_SSL_Cipher:
35 Master_SSL_Key:
36 Seconds_Behind_Master: NULL
37 Master_SSL_Verify_Server_Cert: No
38 Last_IO_Errno: 0
39 Last_IO_Error:
40 Last_SQL_Errno: 0
41 Last_SQL_Error:
42 Replicate_Ignore_Server_Ids:
43 Master_Server_Id: 171978786
44 Master_SSL_Crl:
45 Master_SSL_Crlpath:
46 Using_Gtid: Slave_Pos
47 Gtid_IO_Pos: 171970567-171970567-390719906,0-180359173-4858865027,180359173-180359173-70817914,171978786-171978786-3193273639,180359271-180359271-332498589,171966574-171966574-2221092918,180359241-180359241-121693516,171966670-171966670-2410812544
48 Replicate_Do_Domain_Ids:
49 Replicate_Ignore_Domain_Ids:
50 Parallel_Mode: conservative
51 SQL_Delay: 0
52 SQL_Remaining_Delay: NULL
53 Slave_SQL_Running_State:
54 Slave_DDL_Groups: 0
55Slave_Non_Transactional_Groups: 0
56 Slave_Transactional_Groups: 63183171
57====== 3 ======
58*************************** 1. row ***************************
59 Slave_IO_State:
60 Master_Host: db1123.eqiad.wmnet
61 Master_User: repl
62 Master_Port: 3306
63 Connect_Retry: 60
64 Master_Log_File: db1123-bin.004324
65 Read_Master_Log_Pos: 466847508
66 Relay_Log_File: dbstore1004-relay-bin.000106
67 Relay_Log_Pos: 466847804
68 Relay_Master_Log_File: db1123-bin.004324
69 Slave_IO_Running: No
70 Slave_SQL_Running: No
71 Replicate_Do_DB:
72 Replicate_Ignore_DB:
73 Replicate_Do_Table:
74 Replicate_Ignore_Table:
75 Replicate_Wild_Do_Table:
76 Replicate_Wild_Ignore_Table:
77 Last_Errno: 0
78 Last_Error:
79 Skip_Counter: 0
80 Exec_Master_Log_Pos: 466847508
81 Relay_Log_Space: 466848159
82 Until_Condition: None
83 Until_Log_File:
84 Until_Log_Pos: 0
85 Master_SSL_Allowed: Yes
86 Master_SSL_CA_File:
87 Master_SSL_CA_Path:
88 Master_SSL_Cert:
89 Master_SSL_Cipher:
90 Master_SSL_Key:
91 Seconds_Behind_Master: NULL
92 Master_SSL_Verify_Server_Cert: No
93 Last_IO_Errno: 0
94 Last_IO_Error:
95 Last_SQL_Errno: 0
96 Last_SQL_Error:
97 Replicate_Ignore_Server_Ids:
98 Master_Server_Id: 171978787
99 Master_SSL_Crl:
100 Master_SSL_Crlpath:
101 Using_Gtid: Slave_Pos
102 Gtid_IO_Pos: 171978787-171978787-3089023103,0-171966669-4075108480,171966669-171966669-4196523483,180359174-180359174-94123433,180363367-180363367-134174373,171974792-171974792-378345284,180355192-180355192-321029096
103 Replicate_Do_Domain_Ids:
104 Replicate_Ignore_Domain_Ids:
105 Parallel_Mode: conservative
106 SQL_Delay: 0
107 SQL_Remaining_Delay: NULL
108 Slave_SQL_Running_State:
109 Slave_DDL_Groups: 0
110Slave_Non_Transactional_Groups: 0
111 Slave_Transactional_Groups: 71748739
112====== 4 ======
113*************************** 1. row ***************************
114 Slave_IO_State:
115 Master_Host: db1138.eqiad.wmnet
116 Master_User: repl
117 Master_Port: 3306
118 Connect_Retry: 60
119 Master_Log_File: db1138-bin.004144
120 Read_Master_Log_Pos: 173967607
121 Relay_Log_File: dbstore1004-relay-bin.000124
122 Relay_Log_Pos: 173967903
123 Relay_Master_Log_File: db1138-bin.004144
124 Slave_IO_Running: No
125 Slave_SQL_Running: No
126 Replicate_Do_DB:
127 Replicate_Ignore_DB:
128 Replicate_Do_Table:
129 Replicate_Ignore_Table:
130 Replicate_Wild_Do_Table:
131 Replicate_Wild_Ignore_Table:
132 Last_Errno: 0
133 Last_Error:
134 Skip_Counter: 0
135 Exec_Master_Log_Pos: 173967607
136 Relay_Log_Space: 173968258
137 Until_Condition: None
138 Until_Log_File:
139 Until_Log_Pos: 0
140 Master_SSL_Allowed: Yes
141 Master_SSL_CA_File:
142 Master_SSL_CA_Path:
143 Master_SSL_Cert:
144 Master_SSL_Cipher:
145 Master_SSL_Key:
146 Seconds_Behind_Master: NULL
147 Master_SSL_Verify_Server_Cert: No
148 Last_IO_Errno: 0
149 Last_IO_Error:
150 Last_SQL_Errno: 0
151 Last_SQL_Error:
152 Replicate_Ignore_Server_Ids:
153 Master_Server_Id: 171978876
154 Master_SSL_Crl:
155 Master_SSL_Crlpath:
156 Using_Gtid: Slave_Pos
157 Gtid_IO_Pos: 0-180359175-3368394787,171978876-171978876-2925644422,171966557-171966557-1948492266,180363436-180363436-1155411339,171978775-171978775-4822899280,180359175-180359175-43143523,171970589-171970589-201132050,180359190-180359190-192195477
158 Replicate_Do_Domain_Ids:
159 Replicate_Ignore_Domain_Ids:
160 Parallel_Mode: conservative
161 SQL_Delay: 0
162 SQL_Remaining_Delay: NULL
163 Slave_SQL_Running_State:
164 Slave_DDL_Groups: 0
165Slave_Non_Transactional_Groups: 0
166 Slave_Transactional_Groups: 94416129

root@dbstore1007:/srv# sudo lvextend -L+1100G /dev/mapper/tank-data && sudo xfs_growfs /srv
  Size of logical volume tank/data changed from <7.56 TiB (1981022 extents) to 8.63 TiB (2262622 extents).
  Logical volume tank/data successfully resized.
meta-data=/dev/mapper/tank-data  isize=512    agcount=8, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2028566528, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 2028566528 to 2316924928
root@dbstore1007:/srv# df -hT
Filesystem            Type      Size  Used Avail Use% Mounted on
udev                  devtmpfs  252G     0  252G   0% /dev
tmpfs                 tmpfs      51G   90M   51G   1% /run
/dev/sda1             ext4       37G  3.4G   32G  10% /
tmpfs                 tmpfs     252G     0  252G   0% /dev/shm
tmpfs                 tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                 tmpfs     252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data xfs       8.7T   15G  8.7T   1% /srv
tmpfs                 tmpfs      51G     0   51G   0% /run/user/15343
tmpfs                 tmpfs      51G     0   51G   0% /run/user/0
root@dbstore1007:/srv# pvs
  PV         VG   Fmt  Attr PSize  PFree
  /dev/sda3  tank lvm2 a--  <8.69t 56.30g

Transfer between dbstore1004 and dbstore1007 started

@razzi dbstore1007 still needs to get the proper FW rules (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/697704), as it cannot reach the masters yet:

root@dbstore1007:~# telnet db1122.eqiad.wmnet 3306
Trying 10.64.48.34...

This host needs ipv6 dns to be deleted from netbox

This host needs ipv6 dns to be deleted from netbox

Done

Host added to tendril and zarcillo.
Set to active on Netbox

@razzi transfer has finished and I have configured replication. As soon as you push the new firewall rules, it will start catching up automatically.

Change 697704 merged by jenkins-bot:

[operations/homer/public@master] Add dbstore1007 to analytics firewall

https://gerrit.wikimedia.org/r/697704

@Marostegui new firewall rules are pushed, thanks for the update on your end.

As mentioned on IRC, there might be something else needed as it cannot reach its master yet:

root@dbstore1007:~# telnet db1122.eqiad.wmnet 3306
Trying 10.64.48.34...
^C

@razzi some more things to double check:

[17:57:10]  <@marostegui> Also, is dbstore1007 in the same vlan as dbstore1004?
[17:57:56]  <volans> analytics1-d-eqiad
[17:57:58]  <volans> https://netbox.wikimedia.org/dcim/interfaces/18116/
[17:58:36]  <volans> while dbstore1004 is in private1-b-eqiad
[17:58:36]  <volans> https://netbox.wikimedia.org/dcim/interfaces/16412/
[17:58:45]  <@marostegui> so I guess that's an issue too
[17:58:46]  <volans> so no :)
[17:58:54]  <volans> according to netbox

I am not sure if it is going to be a nightmare or not, but to avoid wiping the copy that Manuel did between 1004 and 1007 today with a reimage we could:

  1. ifdown interface on dbstore1007 manually (+ downtime etc..)
  2. change VLAN on nebox + cookbook to get the new IP(s)
  3. run homer to fix the VLAN tagging on the port
  4. connect to mgmt and set etc's network interfaces manually + ifup

We can reimage without wiping /srv if needed too

The reimage is surely good, but I think that we'd need to fix the ips manually anyway in netbox first. @Volans do you have suggestions about what's best? :)

@elukey what do you need to change, just the vlan hence the IP? Ping me tomorrow and we can do it together.

Change 698656 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Do not format /srv on dbstore1007

https://gerrit.wikimedia.org/r/698656

Change 698656 merged by Marostegui:

[operations/puppet@production] install_server: Do not format /srv on dbstore1007

https://gerrit.wikimedia.org/r/698656

@elukey @razzi I have merged the above patch, which allows dbstore1007 to be reimaged without formatting its /srv.
It needs to have their mysql instances stopped before the reimage though.

I have actually stopped mysql, as they are not replicating anyways. So we can reimage this host anytime.

Manual changes to Netbox have been done by me and @elukey, namely:

  • deleted the IPv4 and IPv6 from dbstore1007
  • selected the VLAN for private1-d-eqiad, opened both IPv4 and IPv6 prefixes, assigned a new IPv4 with dns name and marked as primary, assigned a new IPv6 with the mapped IPv4 without dns and marked as primary
  • updated the port vlan assignation on the switch side

All changes can be seen in the Netbox changelog

@Marostegui the new ips seems to work as expected, if you want to kick of a reimage please go ahead but it should run fine from now on in theory (just verified a telnet to db1122 and it worked). Let me know if it looks ok now or not!

Change 698729 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones

https://gerrit.wikimedia.org/r/698729

@Marostegui the new ips seems to work as expected, if you want to kick of a reimage please go ahead but it should run fine from now on in theory (just verified a telnet to db1122 and it worked). Let me know if it looks ok now or not!

It looks good!!! Do you still think we need the reimage? If we can reach the masters, from my side there's no reimage needed. Anything from your side requiring it?

I don't think it is needed, we can proceed with what we have :)

MySQL started and catching up! Thank you all!! (including @Volans!)
I expect it to be in sync with the master by tomorrow - once done, I will enable GTID.

Once it has caught up, is there anything else needed from your side @razzi @elukey in order to make this host "active" and so we can proceed with dbstore1004 stop and repurpose?

dbstore1007: s2, s3 and s4 is now up-to-date.
GTID is in place.

@razzi @elukey anything else to be done from our side before we can stop and recycle dbstore1004?

Change 698794 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbstore1007: Enable notifications

https://gerrit.wikimedia.org/r/698794

Change 698794 merged by Marostegui:

[operations/puppet@production] dbstore1007: Enable notifications

https://gerrit.wikimedia.org/r/698794

Yep we need to merge https://gerrit.wikimedia.org/r/698729 and verify that everything works as expected :)

Excellent - let me know when we can proceed with the future plans for dbstore1004.
Feel free to close this task once you are done from your side. I will create a different task to repurpose dbstore1004 once this one is closed.

Change 698729 merged by Elukey:

[operations/dns@master] Swap CNAME/SRV records for dbstore1004 with dbstore1007's ones

https://gerrit.wikimedia.org/r/698729

Change 698824 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Fix the dbstore1007 IP after changing VLAN in analytics-in4

https://gerrit.wikimedia.org/r/698824

Change 698824 merged by Elukey:

[operations/homer/public@master] Fix the dbstore1007 IP after changing VLAN in analytics-in4

https://gerrit.wikimedia.org/r/698824

@Marostegui we're ready to migrate over, so I'll mark this as done on our end and close it. Thanks for your help!

Change 694002 abandoned by Razzi:

[operations/homer/public@master] Add dbstore1006 to analytics vlan

Reason:

Not needed

https://gerrit.wikimedia.org/r/694002