Page MenuHomePhabricator

Replace labtestnet2001 with labtestnet2003 and decomission labtestnet2001
Closed, ResolvedPublic

Description

In T196000: move/setup/install labtestnet2003(WMF6469) we got labtestnet2003 setup as the direct replacement for labtestnet2001 which is aging ({T193081}).

This task is for moving functionality from labtestnet2001 to labtestnet2003.

Event Timeline

chasemp triaged this task as Medium priority.Jun 8 2018, 3:40 PM
chasemp created this task.
Vvjjkkii renamed this task from Replace labtestnet2001 with labtestnet2003 and decomission labtestnet2001 to zdbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii removed aborrero as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: aborrero; removed: Aklapper.
CommunityTechBot renamed this task from zdbaaaaaaa to Replace labtestnet2001 with labtestnet2003 and decomission labtestnet2001.Jul 2 2018, 1:14 PM
CommunityTechBot assigned this task to aborrero.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: Aklapper; removed: aborrero.

Change 446059 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud vps: disable labtestnet2001 and replace it with labtestnet2003

https://gerrit.wikimedia.org/r/446059

Change 446059 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud vps: disable labtestnet2001 and replace it with labtestnet2003

https://gerrit.wikimedia.org/r/446059

labtestnet2001 is now empty. Notes: https://etherpad.wikimedia.org/p/labtestnet2001

The active host is now labtestnet2002, while labtestnet2003 is standby.

Change 446069 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud vps: labtest: missing allowed connection

https://gerrit.wikimedia.org/r/446069

Change 446069 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud vps: labtest: missing allowed connection

https://gerrit.wikimedia.org/r/446069

Change 446255 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] Revert "cloud vps: labtest: missing allowed connection"

https://gerrit.wikimedia.org/r/446255

Change 446255 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Revert "cloud vps: labtest: missing allowed connection"

https://gerrit.wikimedia.org/r/446255

Change 446274 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud vps: labtestn: allow more connections from labtest

https://gerrit.wikimedia.org/r/446274

Change 446274 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud vps: labtestn: allow more connections from labtest

https://gerrit.wikimedia.org/r/446274

Ok, this is the status of the labtest cluster as far as I know (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments#Labtest_deployment).

The labtestnet2001 server is currently in spare role, intended for decom soon.
The labtestnet2002 server is currently assigned the net active role.
The labtestnet2003 server, which is new, is assigned the net standby role.

The net nodes have 2 NICs connected to the switches:

  • eth0, the main interface
  • eth1, which is in vlan 2120, this is the vlan of VMs

It turns out that only labtestnet2001 has this configuration actually deployed (switches ports, wires, etc)

  • labtestnet2002 only have eth0 connected, configured and working
  • labtestnet2003 same

It seems we don't have anyone in the codfw datacenter until 2nd Aug to plug the NICS/switch ports, so we might consider rollback.
OR we could try to hack native vlan + trunk on eth0, which is a hack.

aborrero@labtestnet2002:~ $ sudo ethtool eth1 | grep Link
	Link detected: no
aborrero@labtestnet2002:~ $ sudo ip a | grep -e eth1 -e br2102
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
6: br2102: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    inet 10.196.16.1/24 brd 10.196.16.255 scope global br2102
7: eth1.2102@eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue master br2102 state LOWERLAYERDOWN group default
aborrero@labtestnet2003:~ $ sudo ethtool eth1 | grep Link
	Link detected: no
aborrero@labtestnet2003:~ $ sudo ip a | grep -e eth1 -e br2102
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000

Not having this cluster properly running could impact our development of eqiad1 (since @Andrew is working on migration scripts).
In any case, I would like to coordinate before doing further steps.

I'm temporarily reverting this change, pending us having someone in codfw do the needed cable work.

Change 446562 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: labtest: only have one active net node at a time

https://gerrit.wikimedia.org/r/446562

Change 446562 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: labtest: only have one active net node at a time

https://gerrit.wikimedia.org/r/446562

After the rollback, I did some tests and the labtest setup seems to work again.

All connections should be in place and ready for further operations.

Change 450959 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud vps: disable labtestnet2001 and replace it with labtestnet2003

https://gerrit.wikimedia.org/r/450959

Change 450959 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud vps: disable labtestnet2001 and replace it with labtestnet2003

https://gerrit.wikimedia.org/r/450959

I did again this operation today, notes: https://etherpad.wikimedia.org/p/labtestnet2001-again

@ayounsi could you please change the routing again like in T199779 ?

To be clear, you mean this specific change T199779#4430882 ?
Change the static route 10.196.16.0/21 from labtestnet2001 to labtestnet2002?

@ayounsi -- yes, I think that's what he means, except we're switching to 2003, not 2002.

Sync'ed up over IRC, change pushed to cr1/2-codfw:

[edit routing-options static route 10.196.16.0/21]
-    next-hop 10.192.20.5;
+    next-hop 10.192.20.9;

I've confirmed that

  • Puppet doesn't restart nova-network or nova-api on labtestnet1001
  • Puppet /does/ restart nova-network on labtestnet2002
  • With nova services stopped on labtestnet1001 I'm still able to bring up a VM in the codfw region and ssh in

So as far as I'm concerned this switchover is done. We should now open a ticket to decom labtestnet1001.

Thanks! Will handle the decom ticket.

RobH mentioned this in Unknown Object (Task).Nov 29 2018, 8:27 PM