Page MenuHomePhabricator

Move masters away from codfw C6
Closed, ResolvedPublic

Description

The following masters are all in C6 and should be moved away to different rack/rows.
This is a proposal of destination rows.

  • db2048 -> A1
  • db2035 -> B1
  • db2039 -> D1
  • db2040 -> A3
  • db2045 -> B3
  • db2042 (misc m3) -> D3

@Papaul can you confirm those destination racks can get those servers?
Confirmed by  @Papaul

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Let's go for db2035 if that works for you!

Thanks! I will post here as soon as the server is off

new switch port information
asw-b1-codfw ge-1/0/15

Mentioned in SAL (#wikimedia-operations) [2018-04-02T15:28:14Z] <marostegui> Stop MySQL and power off db2035 (s2 codfw master - this will stop replication on s2 codfw slaves) for rack change - T191193

Change 423484 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2035 IP

https://gerrit.wikimedia.org/r/423484

Change 423485 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add db2035 to private1-b-codfw was in private1-c-codfw

https://gerrit.wikimedia.org/r/423485

Change 423484 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2035 IP

https://gerrit.wikimedia.org/r/423484

Mentioned in SAL (#wikimedia-operations) [2018-04-02T15:40:49Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Change db2035 IP - T191193 (duration: 01m 15s)

Change 423485 merged by Marostegui:
[operations/dns@master] DNS: Add db2035 to private1-b-codfw was in private1-c-codfw

https://gerrit.wikimedia.org/r/423485

old switch information
asw-c6-codfw ge-6/0/2

Mentioned in SAL (#wikimedia-operations) [2018-04-02T15:42:09Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change db2035 IP - T191193 (duration: 01m 15s)

mediawiki config files changed
network/interfaces changed
dns merged and deployed

@RobH if switch configuration is not done yet can you please change it from

new switch port information
asw-b1-codfw ge-1/0/15

to
new switch port information
asw-b1-codfw ge-1/0/4

thanks

db2035 was on asw-c6-codfw ge-6/0/2 and now will be on asw-b1-codfw ge-1/0/4

moved db2035 in racktables from C6 to B1

db2035's mysql is back and slaves are reconnecting.
I would suggest next server to be db2039.

switch port information when ready to move db2039. This i just a note for when we are ready to do the move.

db2039 was on asw-c6-codfw ge-6/0/6 and now will be on asw-d1-codfw ge-1/0/14 when

new ip address will be :
10.192.48.114

Mentioned in SAL (#wikimedia-operations) [2018-04-03T05:18:04Z] <marostegui> Enable back gtid on db2035 - T191193

switch port information when ready to move db2039. This i just a note for when we are ready to do the move.

db2039 was on asw-c6-codfw ge-6/0/6 and now will be on asw-d1-codfw ge-1/0/14 when

new ip address will be :
10.192.48.114

Thanks @Papaul - let me know a day that works for you! cc @RobH

@RobH can you let us know when the switch is ready so we can move db2039?
Thanks!

I've gone ahead and enabled asw-d1-codfw ge-1/0/14, and left asw-c6-codfw ge-6/0/6 online for now.

Once the system is fully moved, we'll remove the port info from the old port.

Change 424329 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove db2039 from private1-c-codfw and place it in private1-d-codfw

https://gerrit.wikimedia.org/r/424329

Change 424335 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2039 IP

https://gerrit.wikimedia.org/r/424335

Change 424329 merged by Marostegui:
[operations/dns@master] DNS: Remove db2039 from private1-c-codfw and place it in private1-d-codfw

https://gerrit.wikimedia.org/r/424329

Change 424335 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2039 IP

https://gerrit.wikimedia.org/r/424335

Mentioned in SAL (#wikimedia-operations) [2018-04-05T15:34:12Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change db2039 IP as it is being moved to a different rack - T191193 (duration: 01m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-04-05T15:35:38Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Change db2039 IP as it is being moved to a different rack - T191193 (duration: 01m 17s)

Racktables update. moved db2039 from C6 to D1

Please update the task with the next server to move so I can can the rack ready. Thanks

Please update the task with the next server to move so I can can the rack ready. Thanks

let's go for db2040 as next host
Thanks!

switch port information when ready to move db2040.

db2040 was on asw-c6-codfw ge-6/0/7 and now will be on asw-a3-codfw ge-3/0/ 27

new ip address will be :
10.192.0.39

Let me know if you want to do this today.

Thanks

switch port information when ready to move db2040.

db2040 was on asw-c6-codfw ge-6/0/7 and now will be on asw-a3-codfw ge-3/0/ 27

new ip address will be :
10.192.0.39

Let me know if you want to do this today.

Thanks

Let's wait till next week, I don't want to do many master changes on a single day. Let's go for Tuesday next week? @RobH ?

[edit interfaces interface-range vlan-private1-a-codfw]

member xe-2/0/0 { ... }

+ member ge-3/0/27;
[edit interfaces ge-3/0/27]
+ description db2040;

  • disable;

+ enable;

port now live for the db2040 move. Once it has moved, please update this task so the old port (in row C) can be disabled/deactivated.)

Mentioned in SAL (#wikimedia-operations) [2018-04-10T14:46:20Z] <marostegui> Stop MySQL on db2040 for server move - T191193

Change 425279 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: move db2040 from private1-c-codfw to private1-a-codfw

https://gerrit.wikimedia.org/r/425279

Change 425279 merged by Marostegui:
[operations/dns@master] DNS: move db2040 from private1-c-codfw to private1-a-codfw

https://gerrit.wikimedia.org/r/425279

Change 425285 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2040 IP

https://gerrit.wikimedia.org/r/425285

Change 425285 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2040 IP

https://gerrit.wikimedia.org/r/425285

Mentioned in SAL (#wikimedia-operations) [2018-04-10T15:21:42Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Change db2040 IP as it is being moved to another rack - T191193 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2018-04-10T15:22:52Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change db2040 IP as it is being moved to another rack - T191193 (duration: 00m 59s)

Move db2040 from C6 to A3 in racktables
Please advice what is the next server

switch port information when ready to move db2045.

db2045 was on asw-c6-codfw ge-6/0/14 and now will be on asw-b3-codfw ge-3/0/ 20

new ip address will be :
10.192.16.74

Change 425298 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2045 IP

https://gerrit.wikimedia.org/r/425298

Mentioned in SAL (#wikimedia-operations) [2018-04-10T16:11:07Z] <marostegui> Stop MySQL on db2045 (s8 codfw master) to move it to another rack, this will break replication on codfw - T191193

Change 425303 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: move db2045 from private1-c-codfw to private1-b-codfw

https://gerrit.wikimedia.org/r/425303

Change 425303 merged by Marostegui:
[operations/dns@master] DNS: move db2045 from private1-c-codfw to private1-b-codfw

https://gerrit.wikimedia.org/r/425303

Change 425298 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2045 IP

https://gerrit.wikimedia.org/r/425298

Mentioned in SAL (#wikimedia-operations) [2018-04-10T16:25:00Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change db2045 IP as it is being moved to another rack - T191193 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2018-04-10T16:26:08Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Change db2045 IP as it is being moved to another rack - T191193 (duration: 00m 59s)

moved db2045 from C6 to B3 in racktables

Please update task with next server we need to move next week.

thanks

Papaul triaged this task as Medium priority.Apr 12 2018, 2:01 PM

switch port information when ready to move db2042.

db2042 was on asw-c6-codfw ge-6/0/9 and now will be on asw-d3-codfw ge-3/0/ 10

new ip address will be :
10.192.48.115

@ayounsi can you configure asw-d3-codfw ge-3/0/ 10 for us?
We want to move db2042 to that port

Thanks!

asw-d-codfw-ge-3/0/10 now in private1-d-codfw.

Let me know when to disable asw-c6-codfw:ge-6/0/9

Change 427136 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Move db2042 fron private1-c-codfw to private1-d-codfw

https://gerrit.wikimedia.org/r/427136

Mentioned in SAL (#wikimedia-operations) [2018-04-17T14:52:59Z] <marostegui> Stop MySQL on db2042 to move it to another rack - https://phabricator.wikimedia.org/T191193

Change 427136 merged by Marostegui:
[operations/dns@master] DNS: Move db2042 fron private1-c-codfw to private1-d-codfw

https://gerrit.wikimedia.org/r/427136

switch port information when ready to move db2048.

db2048 was on asw-c6-codfw ge-6/0/17 and now will be on asw-a1-codfw ge-1/0/0

new ip address will be :
10.192.0.99

Moved db2042 from c6 to d3 in racktables

Change 427150 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2048 IP

https://gerrit.wikimedia.org/r/427150

Mentioned in SAL (#wikimedia-operations) [2018-04-17T15:23:37Z] <marostegui> Stop MySQL on db2048 for rack movement - T191193

Change 427151 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Move db2048 from prvate1-c-odfw to private1-a-codfw

https://gerrit.wikimedia.org/r/427151

Change 427151 merged by Marostegui:
[operations/dns@master] DNS: Move db2048 from prvate1-c-odfw to private1-a-codfw

https://gerrit.wikimedia.org/r/427151

Change 427150 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change db2048 IP

https://gerrit.wikimedia.org/r/427150

switch port information when ready to move db2048.
db2048 was on asw-c6-codfw ge-6/0/17 and now will be on asw-a1-codfw ge-1/0/0

asw-a1-codfw ge-1/0/0 enabled and in private1-a-codfw

Let me know when to cleanup the old port.

Mentioned in SAL (#wikimedia-operations) [2018-04-17T15:32:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change db2048 IP - T191193 (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2018-04-17T15:33:46Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Change db2048 IP - T191193 (duration: 00m 58s)

Moved db2048 from C6 to A1 in racktables

@Marostegui assigning the tasks back to you if you think everything looks good you can close.

Thanks

Thanks @Papaul!!
I have talked to @ayounsi and he will clean up the ports and close the task when ready

ayounsi closed this task as Resolved.EditedApr 17 2018, 4:53 PM

asw-a1-codfw ge-1/0/0 cleaned up
asw-c6-codfw ge-6/0/9 cleaned up

EDIT, wrong port:
asw-a1-codfw ge-1/0/0 rolledback
asw-c6-codfw ge-6/0/17 cleaned up

Were the right interfaces disabled after the revert?

Were the right interfaces disabled after the revert?

Yeah:

asw-c6-codfw ge-6/0/17 cleaned up

That was the right one to clean up

Okey, I feel we should check what went wrong (was it the clarity of the communication, was it a one-time mistake that will unlikely happen again, was it the extended downtime on icinga that made the issue not beeing immediately apparent)?

For example, as a procedure, could activity be checked on the port before being disabled to check the host is down/moved away?

For example, as a procedure, could activity be checked on the port before being disabled to check the host is down/moved away?

I thought that was already done. But maybe it was missed this time.
I think it was a combination of all:

  • Confusing the new port with the old port
  • Extended downtime didn't make the issue obvious