Page MenuHomePhabricator

Move cloudvirt hosts to 10Gb ethernet
Open, HighPublic

Description

Most of our cloudvirt hosts have idle 10Gb nics. We haven't used them historically because their row was short on ports. Now that the switches have been upgraded, we can start moving these hosts over to 10Gb.

Each cloudvirt uses two nics. It would be nice to move both to 10Gb but if that's somehow difficult or expensive we can leave the control plane on 1Gb.

  • Determine which servers have unused 10Gb nics
    • all of them, it seems
  • Verify with dc-ops that there are abundant ports Only in racks 2, 4, and 7
  • Write a plan for switching over a given host (presumably as part of a rebuild)

Please note each host will have a sub-task linked into this task for its actual relocation/recabling/reimaging. This is due to each host having multiple steps. This primary tracking task will simply have the task description summarize overall status.

  • cloudvirt1001 - currently in rack b3- empty and depooled -- T221141
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1002 - currently in rack b3- empty and depooled -- T221140
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1003 - currently in rack b3- empty and depooled -- T221139
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1004 - currently in rack b5 - empty and depooled -- T221138
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1005 - currently in rack b5 - empty and depooled -- T221049
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1006 - currently in rack b5 - empty and depooled -- T221048
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1007 - currently in rack 5 - empty and depooled -- T221047
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1008 -- currently empty and depooled -- T216661
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1009 -- currently empty and depooled -- T216324
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1012 -- currently empty and depooled -- T217346
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1013 - currently in rack b4
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1014 - currently in rack b5 -- T226188
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1015 -- currently empty and depooled -- T217140
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1016 - currently in rack b4 -- currently empty and depooled -- T228692
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1017 - currently in rack b7 -- currently empty and depooled -- T228691
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1018 --T217347
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack - 1 of 2 interfaces connected
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1019
  • - system was deployed as 10G, nothing to do here.
  • cloudvirt1020
  • - system was deployed as 10G, nothing to do here.
  • cloudvirt1021 - currently in rack b4 -- T229873
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1022 - currently in rack b7 -- T229872
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1023 -- T229871
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1024 -- currently empty and depooled -- T216724
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team)
  • - system relocated to 10G interfaces/rack - 1 of 2 interfaces connected
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1025
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1026 - currently in rack b1
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1027 - currently in rack b3
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1028 - currently in rack b5
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1029 - currently in rack b6
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster
  • cloudvirt1030 - currently in rack b8
  • - system drained of traffic on 1G, ready for relocation (this checkbox should only be checked by cloud-services-team & only when a sub-task has been populated for this server)
  • - system relocated to 10G interfaces/rack
  • - system reimaged
  • - system reintroduced into service cluster

Related Objects

StatusAssignedTask
OpenNone
ResolvedCmjohnson
ResolvedAndrew
ResolvedVgutierrez
ResolvedAndrew
ResolvedCmjohnson
ResolvedCmjohnson
ResolvedCmjohnson
ResolvedAndrew
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedCmjohnson
ResolvedCmjohnson
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Andrew updated the task description. (Show Details)Feb 19 2019, 11:21 PM
Andrew updated the task description. (Show Details)Feb 19 2019, 11:23 PM
Andrew updated the task description. (Show Details)Feb 20 2019, 3:21 PM
Andrew updated the task description. (Show Details)Feb 28 2019, 5:05 PM
Andrew updated the task description. (Show Details)
Andrew updated the task description. (Show Details)
Andrew updated the task description. (Show Details)Feb 28 2019, 5:09 PM
Andrew updated the task description. (Show Details)
Andrew updated the task description. (Show Details)Feb 28 2019, 5:12 PM
GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM
bd808 triaged this task as High priority.Mar 26 2019, 8:11 PM
bd808 added a project: Epic.
Restricted Application added a project: Operations. · View Herald TranscriptMar 27 2019, 4:53 PM
Andrew added a comment.EditedMar 27 2019, 4:55 PM

To clarify: We can live without this if it can't happen right away. We do, however, need a timeline for when some of it might (or might not) be doable, since preparing for a cloudvirt move and rebuild is a lot of work. Two of our primary goals (removing Trusty hosts and removing the old nova-network infrastructure) are currently blocked, waiting for an answer about scheduling.

Andrew added a comment.Apr 1 2019, 5:04 PM

I've just rechecked, and the following hosts are either empty or only running canary instances:

labvirt1008
cloudvirt1009
cloudvirt1012
cloudvirt1015
cloudvirt1018
cloudvirt1024

RobH mentioned this in Unknown Object (Task).Apr 1 2019, 5:18 PM
RobH added subscribers: Cmjohnson, RobH.EditedApr 1 2019, 5:42 PM

I've just rechecked, and the following hosts are either empty or only running canary instances:
labvirt1008
cloudvirt1009
cloudvirt1012
cloudvirt1015
cloudvirt1018
cloudvirt1024

So @Cmjohnson: Any of those above listed hosts can be moved into 10G racks. When each moves, the following checklist will need to be applied:

  • - set server and its mgmt interface to maint window (no alerts) in icinga
  • - power down server, relocate into new 10G rack, list old switch ports on this task so @RobH can remove from switch stack
  • - update bios on server to re-enable 10G interfaces (leave 1g interfaces enabled as well, our installer should now be able to handle this)
  • - update task with server's two 10G interface ports, primary (for host) and secondary (for instance traffic) on this task, then @RobH can update the switch ports.
  • - after network ports are updated, server can be handed back to @Andrew for reimage.

Also please note we'll rename labvirt1008 as cloudvirt1008 in this process. @Cmjohnson please ensure this happens in netbox and on the physical label.

RobH added a comment.Apr 1 2019, 6:05 PM

Per Chris's request I've gone ahead and put the following servers into maint for the until Friday in icinga:

labvirt1008
cloudvirt1009
cloudvirt1012
cloudvirt1015
cloudvirt1018
cloudvirt1024

Network ports have been set up for the servers below and added to cloud-hosts1 vlan. I need cables cloudvirt1015 and 1024 and will finish once the new cables arrive.

xe-2/0/14 up up cloudvirt1008
xe-2/0/15 up up cloudvirt1009
xe-2/0/16 up up cloudvirt1012
xe-2/0/17 up up cloudvirt1018

I need cables cloudvirt1015 and 1024

Cmjohnson updated the task description. (Show Details)Apr 2 2019, 3:31 PM

Change 500799 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirts: update six servers to use 10Gb nics

https://gerrit.wikimedia.org/r/500799

Change 500799 merged by Andrew Bogott:
[operations/puppet@production] cloudvirts: update six servers to use 10Gb nics

https://gerrit.wikimedia.org/r/500799

RobH added a comment.EditedApr 2 2019, 8:11 PM

Please note I've been going through and updating the firmware of the ilom and bios for the following systems:

  • - cloudvirt1008
  • - cloudvirt1009
  • - cloudvirt1012 - giving me issues, will loop back to it when I complete the rest
  • - cloudvirt1015
  • - cloudvirt1018
  • - cloudvirt1024
RobH added a comment.Apr 2 2019, 10:35 PM

Ok, so attempting to load the following on cloudvirt1012 didn't work, when it worked just fine for cloudvirt100[89]. All are the same DL360 gen8 systems.

So perhaps the HP SPP image should be applied by Chris.

RobH added a comment.Apr 4 2019, 5:42 PM

Checklist for moving a cloudvirt from 1G to 10G:

  • - put system offline in all checks for maint window
  • - relocate to 10G rack and update netbox
  • - enable PXE for 10G interfaces.
  • - update switch configuration for new primary 10G and secondary 10G ports (remove old switch port information)
  • - PXE boot and reimage system
  • - reintroduce system into service cluster
RobH updated the task description. (Show Details)Apr 4 2019, 6:06 PM
RobH updated the task description. (Show Details)Apr 4 2019, 6:19 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Apr 4 2019, 8:11 PM

Change 501712 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirts: update nic names for 10Gb

https://gerrit.wikimedia.org/r/501712

Change 501712 merged by Andrew Bogott:
[operations/puppet@production] cloudvirts: update nic names for 10Gb

https://gerrit.wikimedia.org/r/501712

Andrew updated the task description. (Show Details)Apr 15 2019, 9:53 PM
Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Apr 16 2019, 6:25 PM
Andrew updated the task description. (Show Details)Apr 16 2019, 8:00 PM
Andrew updated the task description. (Show Details)Apr 16 2019, 8:14 PM
Andrew updated the task description. (Show Details)Apr 18 2019, 6:28 PM
Andrew updated the task description. (Show Details)Jun 20 2019, 2:21 PM
Andrew updated the task description. (Show Details)Jun 20 2019, 3:10 PM

Change 524835 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: update hypervisor scheduling pool

https://gerrit.wikimedia.org/r/524835

Change 524835 merged by Andrew Bogott:
[operations/puppet@production] nova: update hypervisor scheduling pool

https://gerrit.wikimedia.org/r/524835

Andrew updated the task description. (Show Details)Jul 22 2019, 7:56 PM
Andrew updated the task description. (Show Details)Jul 23 2019, 3:55 PM
Cmjohnson updated the task description. (Show Details)Jul 25 2019, 6:40 PM
Andrew updated the task description. (Show Details)Aug 5 2019, 7:19 PM

Change 528231 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: update scheduler pool

https://gerrit.wikimedia.org/r/528231

Change 528231 merged by Andrew Bogott:
[operations/puppet@production] nova: update scheduler pool

https://gerrit.wikimedia.org/r/528231