Page MenuHomePhabricator

relocate/reimage cloudvirt1007 with 10G interfaces
Closed, ResolvedPublic

Description

  • - put system offline in all checks for maint window
  • - apply bios/ilo updates
  • - update RAID to include spare drives
  • - rename/rebuild system from labvirt1007 to cloudvirt1007, move to role::spare
  • - relocate to 10G rack and update netbox
  • - enable PXE for 10G interfaces.
  • - update switch configuration for new primary 10G Nic
  • - update switch configuration and attach secondary 10G port
  • - remove old switch port info
  • - PXE boot and reimage system - change netbox status from planned to staged
  • - update netbox with new name and location
  • - reintroduce system into service cluster - change netbox status from staged to active
  • - update switch and physical labels with new name and location

Event Timeline

Andrew created this task.Apr 15 2019, 9:50 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptApr 15 2019, 9:51 PM
Andrew updated the task description. (Show Details)Apr 15 2019, 10:00 PM

Change 504215 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Rename some labvirts to cloudvirts

https://gerrit.wikimedia.org/r/504215

Change 504217 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Rename some labvirts to cloudvirts

https://gerrit.wikimedia.org/r/504217

Change 504215 merged by Andrew Bogott:
[operations/puppet@production] Rename some labvirts to cloudvirts

https://gerrit.wikimedia.org/r/504215

Change 504217 merged by Andrew Bogott:
[operations/dns@master] Rename some labvirts to cloudvirts

https://gerrit.wikimedia.org/r/504217

colewhite triaged this task as Normal priority.Apr 16 2019, 3:40 PM
Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Apr 16 2019, 6:14 PM

MAC F0:92:1C:05:4A:98

Cmjohnson updated the task description. (Show Details)Apr 17 2019, 8:29 PM

This host has a broken disk 300Gb SAS drive -- that needs to be replaced before we can re-image.

RobH updated the task description. (Show Details)Apr 18 2019, 8:39 PM
Cmjohnson updated the task description. (Show Details)Apr 19 2019, 5:41 PM

I created a procurement task T221470

Dzahn changed the task status from Open to Stalled.Apr 29 2019, 7:10 PM

@Andrew the disk has been replaced, all yours to install

Andrew is traveling this week, so I will handle the reimage.

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905021545_aborrero_241266_cloudvirt1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905021545_aborrero_241339_cloudvirt1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905021546_aborrero_241395_cloudvirt1007_eqiad_wmnet.log.

aborrero changed the task status from Stalled to Open.May 2 2019, 3:49 PM
aborrero claimed this task.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905021631_aborrero_250834_cloudvirt1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905021658_aborrero_255927_cloudvirt1007_eqiad_wmnet.log.

I had some issues with reimaging because the drive replacement. LVM cound't find the old disk UUID (obviously) and I had to force things by hand until the wmf-reimage-script could run smoothly enough.
Still we have a prompt in the debian installer asking for a confirmation in the partman recipe.

The last wmf-auto-reimage script run seems to be working just fine.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']
=> ctrl slot=0 pd all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 146 GB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 146 GB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 300 GB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 300 GB): OK
   physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 300 GB): OK
   physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 300 GB): OK
   physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 300 GB): OK
   physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 300 GB): OK
   physicaldrive 1I:1:9 (port 1I:box 1:bay 9, 300 GB): OK
   physicaldrive 1I:1:11 (port 1I:box 1:bay 11, 300 GB): OK
   physicaldrive 1I:1:12 (port 1I:box 1:bay 12, 300 GB): OK
   physicaldrive 1I:1:13 (port 1I:box 1:bay 13, 300 GB): OK
   physicaldrive 2I:1:14 (port 2I:box 1:bay 14, 300 GB): OK
   physicaldrive 2I:1:15 (port 2I:box 1:bay 15, 300 GB): OK
   physicaldrive 2I:1:16 (port 2I:box 1:bay 16, 300 GB): OK
   physicaldrive 2I:1:17 (port 2I:box 1:bay 17, 300 GB): OK
   physicaldrive 1I:1:10 (port 1I:box 1:bay 10, 300 GB, spare): OK
   physicaldrive 2I:1:18 (port 2I:box 1:bay 18, 300 GB, spare): OK


=> ctrl slot=0 ld all show status

   logicaldrive 1 (136.7 GB, 1): OK
   logicaldrive 2 (1.9 TB, 1+0): OK

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905031007_aborrero_206352_cloudvirt1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905031033_aborrero_211644_cloudvirt1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905031034_aborrero_211701_cloudvirt1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudvirt1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905031040_aborrero_213590_cloudvirt1007_eqiad_wmnet.log.

The trick was to manually set PXE boot for the debian installer and while the installer is working manually switch to disk boot.

</>hpiLO-> set /system1/bootconfig1/bootsource5 bootorder=1

status=0
status_tag=COMMAND COMPLETED
Fri May  3 10:39:48 2019

Bootorder being set.

bootsource5=BootFmNetwork1   bootorder=1
bootsource2=BootFmFloppy    bootorder=2
bootsource4=BootFmUSBKey    bootorder=3
bootsource1=BootFmCd        bootorder=4
bootsource3=BootFmDisk      bootorder=5
bootsource6=BootFmNetwork2   bootorder=6

</>hpiLO-> set /system1/bootconfig1/bootsource3 bootorder=1

status=0
status_tag=COMMAND COMPLETED
Fri May  3 10:46:49 2019

Bootorder being set.

bootsource3=BootFmDisk      bootorder=1
bootsource2=BootFmFloppy    bootorder=2
bootsource4=BootFmUSBKey    bootorder=3
bootsource1=BootFmCd        bootorder=4
bootsource5=BootFmNetwork1   bootorder=5
bootsource6=BootFmNetwork2   bootorder=6

Completed auto-reimage of hosts:

['cloudvirt1007.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-cloud) [2019-05-03T11:53:18Z] <arturo> T221047 create VM instance canary1007-01 and place it in cloudvirt1007

Change 507949 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: eqiad1: repool cloudvirt1007

https://gerrit.wikimedia.org/r/507949

aborrero closed this task as Resolved.May 3 2019, 12:07 PM
aborrero updated the task description. (Show Details)

Change 507949 merged by Andrew Bogott:
[operations/puppet@production] openstack: eqiad1: repool cloudvirt1007

https://gerrit.wikimedia.org/r/507949