Page MenuHomePhabricator

relocate/reimage cloudvirt1008 with 10G interfaces
Closed, ResolvedPublic

Description

This host only holds a canary instance; it can be moved to eqiad1-r any time.

This includes renaming the host from labvirt1008 to cloudvirt1008.

  • disable puppet in labvirt1008
  • drain labvirt1008
  • update RAID configuration to create spare disks
  • re-rack host in 10G rack
  • - connect primary 10G port and enable PXE (right now its connected but PXE is not enabled)
  • - connect secondary 10G port and setup switch port
  • - assign back to @RobH for followup after the above is set
  • merge puppet patch to rename, get the new debian installer working and disable notifications (rename hieradata/hosts/labvirt1009yaml to cloudvirt1009.yaml and add "profile::base::notifications: disabled" temporarily)
  • merge dns patch to add the new FQDNs (partial, the old mgmt names still remains)
  • re-image
  • merge puppet patch to re-enable notifications (remove "profile::base::notifications")
  • netbox update https://netbox.wikimedia.org/dcim/devices/1452/
  • update docs https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments
  • physical relabeling and switch port description T220443
  • done

Details

Related Gerrit Patches:

Event Timeline

Andrew created this task.Feb 20 2019, 10:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 20 2019, 10:04 PM
Andrew updated the task description. (Show Details)Feb 21 2019, 3:43 PM
Andrew updated the task description. (Show Details)
Andrew updated the task description. (Show Details)Feb 21 2019, 4:01 PM

Mentioned in SAL (#wikimedia-operations) [2019-02-22T14:03:51Z] <moritzm> removed labvirt1008 from debmonitor (T216661)

attempting to silence some alerts (I think a result of an expiring downtime) I restarted this and it had a hard time coming up, complaining about a failure to mount drives. I told it to skip the mount and now that it's up it looks fine but this needs more investigation.

the next step is moving this to a new rack for 10G connections (either 2, 4 or 7 in row B) so I'm tagging dc-ops. You can hand it back to me for the rename/rebuild once it's in place.

Restricted Application added a project: Operations. · View Herald TranscriptFeb 28 2019, 5:16 PM
jbond triaged this task as Medium priority.Mar 4 2019, 7:45 PM

Change 499480 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] installer: prepare to rebuild labvirt1008 as cloudvirt1009

https://gerrit.wikimedia.org/r/499480

Change 499480 merged by Andrew Bogott:
[operations/puppet@production] installer: prepare to rebuild labvirt1008 as cloudvirt1009

https://gerrit.wikimedia.org/r/499480

Mentioned in SAL (#wikimedia-operations) [2019-04-02T07:52:57Z] <moritzm> removed labvirt1008 from debmonitor (T216661)

Change 500800 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Rename labvirt1008 to cloudvirt1008

https://gerrit.wikimedia.org/r/500800

Change 500800 merged by Andrew Bogott:
[operations/dns@master] Rename labvirt1008 to cloudvirt1008

https://gerrit.wikimedia.org/r/500800

Andrew updated the task description. (Show Details)Apr 4 2019, 5:57 PM
RobH added a subscriber: RobH.Apr 4 2019, 5:58 PM

So when attempting to setup PXE on this, the network device 2 (ie the 10G interface) isn't showing as a bootable device. The identical system (cloudvirt1009) shows the bootable device referenced, so it seems this system didnt get PXE enabled/saved in bios.

RobH updated the task description. (Show Details)Apr 4 2019, 6:23 PM
RobH renamed this task from cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 to relocate/reimage cloudvirt1009 with 10G interfaces.Apr 4 2019, 6:31 PM
RobH renamed this task from relocate/reimage cloudvirt1009 with 10G interfaces to relocate/reimage cloudvirt1008 with 10G interfaces.
RobH assigned this task to Cmjohnson.
RobH updated the task description. (Show Details)
Andrew updated the task description. (Show Details)Apr 5 2019, 2:30 PM
Cmjohnson updated the task description. (Show Details)Apr 5 2019, 3:59 PM
Cmjohnson reassigned this task from Cmjohnson to RobH.Apr 5 2019, 4:02 PM
Cmjohnson added a subscriber: Cmjohnson.

Assigning back to robh NIC has been enabled to PXE, second cable has been run, port description updated and added to cloud-virt-instance-trunk

RobH added a comment.Apr 5 2019, 6:52 PM

I'll describe the issue I'm seeing, and how I've troubleshot it, so far to no avail:

cloudvirt1008 will PXE boot, and does so successfully off its 10G interface. If I login to the https://cloudvirt1008.mgmt.eqiad.wmnet mgmt interface, I can see the MAC addresses for the network devices:

10G NIC port 1 (of 2) MAC : f0:92:1c:05:bc:d8
10G NIC port 2 (of 2) MAC : f0:92:1c:05:bc:dc

I boot the system into the one time boot of 'network device 1' and it shows it attempting to network boot with MAC address f0:92:1c:05:bc:d8. We can even see it hit the DHCP server:

Apr  5 18:40:23 install1002 dhcpd: labvirt1009.eqiad.wmnet: host unknown.
Apr  5 18:40:23 install1002 dhcpd: DHCPDISCOVER from f0:92:1c:05:bc:d8 via 10.64.20.2: network 10.64.20.0/24: no free leases
Apr  5 18:40:23 install1002 dhcpd: DHCPDISCOVER from f0:92:1c:05:bc:d8 via 10.64.20.3: network 10.64.20.0/24: no free leases
Apr  5 18:40:27 install1002 dhcpd: DHCPDISCOVER from f0:92:1c:05:bc:d8 via 10.64.20.2: network 10.64.20.0/24: no free leases
Apr  5 18:40:27 install1002 dhcpd: DHCPDISCOVER from f0:92:1c:05:bc:d8 via 10.64.20.3: network 10.64.20.0/24: no free leases

The first line is the kicker, it points out a bad entry trying to reference labvirt1009.eqiad.wmnet. Looking in the lease file shows the mistake, someone seems to have introduced a copy/paste error. (I'm not bothering to run the git log to find blame because who cares.) Fix following shortly.

Change 501666 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] cloudvirt1008 dhcp lease file correction

https://gerrit.wikimedia.org/r/501666

Change 501666 merged by RobH:
[operations/puppet@production] cloudvirt1008 dhcp lease file correction

https://gerrit.wikimedia.org/r/501666

Andrew updated the task description. (Show Details)Apr 8 2019, 7:47 PM
Andrew updated the task description. (Show Details)Apr 8 2019, 7:49 PM
RobH removed RobH as the assignee of this task.Apr 10 2019, 10:17 PM
Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Apr 16 2019, 6:30 PM
Cmjohnson closed this task as Resolved.Apr 16 2019, 6:39 PM
Cmjohnson claimed this task.
Cmjohnson updated the task description. (Show Details)

Completed