Page MenuHomePhabricator

Update and move labnet1001/1002
Closed, ResolvedPublic

Description

Labnet1001 is due for some kernel updates which we've been postponing since a reboot will interrupt WMCS network traffic. Now that labnet1001 also needs to be re-racked, we may as well get everything done at once.

  • Update/reboot labnet1002
  • Move active traffic to labnet1002 (will cause service outage) 2018-05-15 13:00 UTC
  • Update/reboot labnet1001
  • labnet1001 moves from B3 to B2 2018-05-15 14:00 UTC
  • Move active traffic back to labnet1001 (will cause another service outage) 2018-05-15 16:00 UTC
  • Move labnet1002 to new switch

Related Objects

Event Timeline

Andrew triaged this task as Medium priority.May 1 2018, 10:12 PM
Andrew created this task.

Mentioned in SAL (#wikimedia-operations) [2018-05-01T22:19:16Z] <andrewbogott> rebooting labnet1002 for T193579

root@labnet1002:~# uname -a
Linux labnet1002 3.13.0-145-generic #194-Ubuntu SMP Thu Apr 5 15:20:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@Andrew @chasemp One other thing we should do here is move labnet1002 to the new switch.

Can we do this on May 15? 1500UTC/1000 EST

@Cmjohnson, The 15th sounds good for labnet1001. I don't think 1500UTC is the same thing as 1000 EST but I'm going to assume that the EST part is what interests you :)

Labnet1002 isn't handling any traffic right now, so if you want to move that to the new switch sometime soon that'd be just fine, just let me know when it's done.

@Andrew @chasemp One other thing we should do here is move labnet1002 to the new switch.

Can we do this on May 15? 1500UTC/1000 EST

ah, we were confused. From https://phabricator.wikimedia.org/T193196#4164325 I thought labnet1002 was staying put entirely. Let's move labnet1002 before we plan to do the failover to it and other maint. It can be moved any time while it is standby. Today or tomorrow?

Andrew updated the task description. (Show Details)

@Cmjohnson did you get a chance to move labnet1002?

@chasemp @andrewbogott . no, it has not been moved yet

OK. In theory we can move it after the outage window tomorrow, since we're planning to switch all traffic back to labnet1001 after it gets re-racked. The only risk I can think of is if labnet1001 doesn't survive the move and we have to rely on 1002 long-term.

Change 430118 had a related patch set uploaded (by Andrew Bogott; owner: cpettet):
[operations/puppet@production] openstack: move nova-api and nova-network functions to labnet1002

https://gerrit.wikimedia.org/r/430118

Mentioned in SAL (#wikimedia-operations) [2018-05-15T12:59:50Z] <andrewbogott> stopping puppet on labnet1001 and 1002, silencing icinga for T193579

Change 430118 merged by Andrew Bogott:
[operations/puppet@production] openstack: move nova-api and nova-network functions to labnet1002

https://gerrit.wikimedia.org/r/430118

Mentioned in SAL (#wikimedia-operations) [2018-05-15T13:07:42Z] <andrewbogott> stopping nodepool and puppet on labnodepool1001 for T193579

Change 433153 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] openstack: move nova-api and nova-network functions to labnet1001

https://gerrit.wikimedia.org/r/433153

Change 433153 merged by Andrew Bogott:
[operations/puppet@production] openstack: move nova-api and nova-network functions to labnet1001

https://gerrit.wikimedia.org/r/433153

NOTICE: current situation is labnet1002 as a SPOF

We ran through our normal procedure to fail traffic from labnet1002 back to labnet1001 (post move this morning). Labnet1001 saw incoming traffic from external parties hit eth0 but could not route that to any instances. The bridge interface and addressing came up for br1102 and the gateway IP transferred but still no connectivity. I looked at eth1 there and the corresponding switch interface and did not see anything that immediately made me think this was a quick fix so we decided to move traffic back to labnet1002.

In order to debug we pulled a not-in-use IP from the instance range and assigned it to the VLAN subinterface of eth1.1102 on labnet1001 which should then be able to connect to instances. This is currently not successful in doing so which I believe demonstrates the same failure we saw during failover.


labnet1001:~# ip addr add 10.68.23.183/21 dev eth1.1102

labnet1001:~# ip link show eth1.1102

4: eth1.1102@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:50:42:2a brd ff:ff:ff:ff:ff:ff

labnet1001:~# ip route show

default via 10.64.20.1 dev eth0
10.64.20.0/24 dev eth0  proto kernel  scope link  src 10.64.20.13
10.68.16.0/21 dev eth1.1102  proto kernel  scope link  src 10.68.23.183

login.tools.wmflabs.org has address 208.80.155.163
tools-bastion-03.tools.eqiad.wmflabs has address 10.68.23.58

(this should work)

root@labnet1001:~# ping -c 3 10.68.23.58
PING 10.68.23.58 (10.68.23.58) 56(84) bytes of data.
From 10.68.23.183 icmp_seq=1 Destination Host Unreachable
From 10.68.23.183 icmp_seq=2 Destination Host Unreachable
From 10.68.23.183 icmp_seq=3 Destination Host Unreachable

cloud-instances1-b-eqiad was not trunked between asw2 and asw, it is now.

cloud-instances1-b-eqiad was not trunked between asw2 and asw, it is now.

Thanks man.

Looks good:

root@labnet1001:~# ping -c 3 10.68.23.58
PING 10.68.23.58 (10.68.23.58) 56(84) bytes of data.
64 bytes from 10.68.23.58: icmp_seq=1 ttl=64 time=0.886 ms
64 bytes from 10.68.23.58: icmp_seq=2 ttl=64 time=0.311 ms
64 bytes from 10.68.23.58: icmp_seq=3 ttl=64 time=0.281 ms

Change 433383 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] openstack: move nova-api and nova-network functions to labnet1001

https://gerrit.wikimedia.org/r/433383

Change 433383 merged by Andrew Bogott:
[operations/puppet@production] openstack: move nova-api and nova-network functions to labnet1001

https://gerrit.wikimedia.org/r/433383

@Cmjohnson, all set for you to move labnet1002 now.

@Andrew This was completed...if all is well please resolve.

Vvjjkkii renamed this task from Update and move labnet1001/1002 to 4tdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Cmjohnson; removed: gerritbot, Aklapper.
Yann renamed this task from 4tdaaaaaaa to Update and move labnet1001/1002.Jul 1 2018, 1:34 PM
Yann closed this task as Resolved.
Yann assigned this task to Cmjohnson.
Yann lowered the priority of this task from High to Medium.
Yann updated the task description. (Show Details)
Yann edited subscribers, added: gerritbot, Aklapper; removed: Cmjohnson.