Page MenuHomePhabricator

codfw spare pool system for partman testing
Closed, ResolvedPublic

Description

This task will track the temp allocation of a spare pool system to @CDanis for work in testing new partman configs (and fixing T215183).

Allocating wmf6653 in codfw d8/U5

https://netbox.wikimedia.org/dcim/devices/824/

Things that need to happen for this to be used for testing:

  • - add theemin add hostname to mgmt dns entries
  • - add production entries for hostname (needs production dns entry to pxe boot) (use private1-d-codfw subnet)
  • - update network port label and vlan
  • - update netboot to point at testing recipes
  • - test as needed
  • - when done testing, decom the system to reclaim to spares (leaving in netbox, placing back to inventory status with only idrac dns setup

Event Timeline

RobH triaged this task as Medium priority.
RobH updated the task description. (Show Details)

network port asw-d-codfw:ge-8/0/4 labeled as 'theemin' and set to private1-d-codfw vlan

Change 488094 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] theemin.codfw and theemin.mgmt.codfw

https://gerrit.wikimedia.org/r/488094

Change 488094 merged by CDanis:
[operations/dns@master] theemin.codfw and theemin.mgmt.codfw

https://gerrit.wikimedia.org/r/488094

Change 488097 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] theemin testing: add dhcp and netboot configs

https://gerrit.wikimedia.org/r/488097

Change 488097 merged by CDanis:
[operations/puppet@production] theemin testing: add dhcp and netboot configs

https://gerrit.wikimedia.org/r/488097

Script wmf-auto-reimage was launched by cdanis on cumin2001.codfw.wmnet for hosts:

['theemin.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201902051753_cdanis_22559.log.

RobH added a subscriber: Papaul.
This comment was removed by RobH.
  1. Installed server with standard raid1-lvm-ext4-srv.cfg partman config
    • Booted fine
  2. Went into BIOS and swapped boot order of SATA devices (afterwards, port B first)
    • Server seemed to hang at "Booting from Hard drive C:"
  3. back into BIOS; swapped boot order back to original (port A first).
    • Booted fine
  4. used install-console to grub-install /dev/sdb
  5. rebooted using default boot order
    • Booted fine
  6. back into BIOS; swap boot order (port B first)
    • Booted fine!

Change 488110 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] add 'dualboot' fork of raid1-lvm-ext4-srv & try it on theemin

https://gerrit.wikimedia.org/r/488110

Change 488110 merged by CDanis:
[operations/puppet@production] add 'dualboot' fork of raid1-lvm-ext4-srv & try it on theemin

https://gerrit.wikimedia.org/r/488110

  1. Overwrote first 512 bytes of /dev/sda and /dev/sdb with zeros
    • Boot automatically fell through disk to using PXE
  2. Reimaged system using raid1-lvm-ext4-srv-dualboot.cfg
    • Got a warning about the LVM volume group name being already in use; clicked through it manually
  3. Installer rebooted system, leaving BIOS disk boot order to port B
    • Hung at "Booting from Hard drive C:"
  4. Reset BIOS boot order to port A first
    • Booted fine

Next to try is adding @faidon's old workaround (which Debian upstream claims is no longer necessary as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=666974 is closed).

Change 488112 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] dualboot: grub-installer/only_debian --> false

https://gerrit.wikimedia.org/r/488112

Change 488112 merged by CDanis:
[operations/puppet@production] dualboot: grub-installer/only_debian --> false

https://gerrit.wikimedia.org/r/488112

  1. Overwrote first 512 bytes of /dev/sda and /dev/sdb with zeros
    • Boot automatically fell through disk to using PXE
  2. Reimaged system again, with the change for the only_debian setting
  3. BIOS boot order port A first
    • booted fine
  4. Swapped boot order: port B
    • booted fine!

Grub does seem to be installed on both MBRs, with just minimal differences (going to guess pointing at the different physical disk for /):

root@theemin:~# diff -u <(dd status=none if=/dev/sda bs=512 count=1 | od -A x -t x1z -v) <(dd status=none if=/dev/sdb bs=512 count=1 | od -A x -t x1z -v)
--- /dev/fd/63	2019-02-05 19:58:35.256147601 +0000
+++ /dev/fd/62	2019-02-05 19:58:35.256147601 +0000
@@ -25,7 +25,7 @@
 000180 7d e8 2e 00 cd 18 eb fe 47 52 55 42 20 00 47 65  >}.......GRUB .Ge<
 000190 6f 6d 00 48 61 72 64 20 44 69 73 6b 00 52 65 61  >om.Hard Disk.Rea<
 0001a0 64 00 20 45 72 72 6f 72 0d 0a 00 bb 01 00 b4 0e  >d. Error........<
-0001b0 cd 10 ac 3c 00 75 f4 c3 d0 62 80 19 00 00 00 20  >...<.u...b..... <
+0001b0 cd 10 ac 3c 00 75 f4 c3 18 81 77 c5 00 00 80 20  >...<.u....w.... <
 0001c0 21 00 fd fe ff ff 00 08 00 00 00 18 d2 05 00 fe  >!...............<
 0001d0 ff ff fd fe ff ff 00 20 d2 05 00 d0 1d 00 00 fe  >....... ........<
 0001e0 ff ff fd fe ff ff 00 f0 ef 05 00 50 02 16 00 00  >...........P....<

@Papaul when you're back in codfw and have a spare moment, can you please unplug the disk on SATA port A from wmf6653 (in row D), and attempt booting it?
The disk on port B should also have a valid boot record on it.

Assuming the server fails to boot -- not sure if it will or not! -- then could you go into the BIOS and change: System BIOS Settings > Boot Settings > Hard-Disk Failover == True? And then try booting it again. Thanks!

(Continuing the log from above: after verifying that sdb / port B seems bootable, I've reset the BIOS boot order to the default of port A first, and powered off the machine.)

  • Remove sda from the server
  • boot the server
  • server boot without a problem

Change 576840 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use buster for theemin

https://gerrit.wikimedia.org/r/576840

Change 576840 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use buster for theemin

https://gerrit.wikimedia.org/r/576840

Aklapper subscribed.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

@CDanis,

Is testing complete with this host? If so, should we reclaim to spares and/or decom it?

These days we have sretest*, so should be good to reclaim.

wiki_willy added a project: ops-codfw.
wiki_willy subscribed.

Looks like this one fell through the cracks without the "ops-codfw" project tag, so adding it back in. cc @Papaul

cookbooks.sre.hosts.decommission executed by pt1979@cumin2002 for hosts: theemin.codfw.wmnet

  • theemin.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

The only thing left on this task is to unrack the server and remove all the disks.

Papaul updated the task description. (Show Details)

Complete