Page MenuHomePhabricator

setup/install krb2001/WMF6577
Closed, ResolvedPublic

Description

This task will track the setup and installation of WMF6577 as a kerbos node in codfw.

Please note this task will be initially assigned to @elukey for feedback on a number of installation required variables.

Hostnames: krb2001
Networking/Subnet/VLAN/IP: Internal subnet
Partitioning/Raid: standard raid1 sw setup

krb2001/WMF6577:

  • - T233962 created to apply hostname labels
  • - WMF6577 set from 'inventory' to 'planned' as it is allocated for use and no longer a spare system.
  • - switch port updated (description and vlan)
  • - netbox entry updated (hostname)
  • - dns update
  • - puppet repo update
  • - OS installation
  • - inital puppet run
  • - netbox entry status changed to 'staged'
  • - handoff to service implementation team
  • - sub-team which implements the service must change netbox status to 'active'

Event Timeline

@elukey,

Please note that both T233141 (eqiad) and T233142 (codfw) are nearly identical.

We need the following info to setup these hosts:

  • Hostnames?
  • Internal or External subnet?
  • standard raid1 of the ssds (with /srv in its own lvm) standard for partitioning?

Please address the above and assign back to me for followup.

Thanks!

Thanks a lot!

Hostnames: krb1001 krb2001 (already updated the naming conventions in wikitech)
Internal subnet, no Analytics VLAN
raid1 is good enough

@RobH any chance to get this done by today/tomorrow? Really sorry to press you but it would help a lot in trying to make a quarterly goal.. If you are busy no problem!

RobH renamed this task from setup/install codfw kerbos node WMF6577 to setup/install krb2001/WMF6577.Sep 26 2019, 3:53 PM
RobH triaged this task as High priority.
RobH updated the task description. (Show Details)
RobH added a subscriber: Papaul.

Unfortunately, it appears the switch port for this system is not labeled on asw-d8-codfw, so we'll need @Papaul to trace it out and update this task with the port info.

@Papaul: Please update this task with the port for this system (it is located in D8-codfw) so we can install it, thanks! You can assign back to me after comment.

Change 539358 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] updating krb2001 mgmt and prod dns

https://gerrit.wikimedia.org/r/539358

Change 539358 merged by RobH:
[operations/dns@master] updating krb2001 mgmt and prod dns

https://gerrit.wikimedia.org/r/539358

RobH updated the task description. (Show Details)

Change 539363 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] krb2001 install params

https://gerrit.wikimedia.org/r/539363

Change 539363 merged by RobH:
[operations/puppet@production] krb2001 install params

https://gerrit.wikimedia.org/r/539363

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
ge-8/0/3        down  down krb2001

Everything is ready for this to install, but it doesn't see any network attachment on its primary interface when trying to PXE boot.

@Papaul: Please troubleshoot this devices production network connection, it is down and won't PXE boot. May be a lose connection, wrong port info, or bad cable.

it was in disabled and i missed it, papaul pointed it out, fixed.

grub is failing to install on sda, regardless of distro.

I think this may be a hardware failure, perhaps we shoudl swap them around and see if the error follows the disk.

RobH reassigned this task from RobH to Papaul.EditedSep 26 2019, 6:54 PM

@Papaul,

Since this is failing grub on sda, I'd like to see if it is a disk issue (most likely), bay issue (moderately likely if the backplane is bad), or software issue (potentially it).

If you would swap sda and sdb around (these are hotswap disks), and assign back to me, I can troubleshoot further.

Right now, the installer fails after partitioning, during grub install.

Very strange, the debian install works in setting up raids and lvm volumes, but fail when installing grub. I noticed that the host has multiple huge disks (4TB each), and they all get a gpt partition (checked via d-i shell's parted_devices). Could it be that grub is not working with these huge disk partitions/sizes?

Very strange, the debian install works in setting up raids and lvm volumes, but fail when installing grub. I noticed that the host has multiple huge disks (4TB each), and they all get a gpt partition (checked via d-i shell's parted_devices). Could it be that grub is not working with these huge disk partitions/sizes?

That should not be an issue with Grub 2 in general, did we capture the full error message of grub-install? Might also be worth updating the server's firmware if it's old.

Very strange, the debian install works in setting up raids and lvm volumes, but fail when installing grub. I noticed that the host has multiple huge disks (4TB each), and they all get a gpt partition (checked via d-i shell's parted_devices). Could it be that grub is not working with these huge disk partitions/sizes?

That should not be an issue with Grub 2 in general, did we capture the full error message of grub-install? Might also be worth updating the server's firmware if it's old.

I am trying to see if krb2001) echo partman/raid10-gpt-srv-lvm-ext4.cfg ;; \ makes any difference, but I don't know exactly where to get the grub logs. Are they somewhere in syslog in the d-i shell?

The new recipe seems to have worked!

Change 539518 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Change partman recipe for krb2001

https://gerrit.wikimedia.org/r/539518

Change 539518 merged by Elukey:
[operations/puppet@production] Change partman recipe for krb2001

https://gerrit.wikimedia.org/r/539518

elukey@krb2001:~$ df -h
Filesystem                    Size  Used Avail Use% Mounted on
udev                           32G     0   32G   0% /dev
tmpfs                         6.3G  9.9M  6.3G   1% /run
/dev/md0                       92G  1.6G   85G   2% /
tmpfs                          32G     0   32G   0% /dev/shm
tmpfs                         5.0M     0  5.0M   0% /run/lock
tmpfs                          32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/krb2001--vg-data  5.8T   89M  5.5T   1% /srv
tmpfs                         6.3G     0  6.3G   0% /run/user/13926
tmpfs                         6.3G     0  6.3G   0% /run/user/0

elukey@krb2001:~$ cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md1 : active raid10 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      7716110336 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [>....................]  resync =  3.2% (253352064/7716110336) finish=618.8min speed=200991K/sec
      bitmap: 57/58 pages [228KB], 65536KB chunk

md0 : active raid10 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      97589248 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

5.5T of /srv in raid10 is not bad :D

Change 539532 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add AAAA/PTR IPv6 records for kerb2001.codfw.wmnet

https://gerrit.wikimedia.org/r/539532

elukey updated the task description. (Show Details)

Change 539532 merged by Elukey:
[operations/dns@master] Add AAAA/PTR IPv6 records for kerb2001.codfw.wmnet

https://gerrit.wikimedia.org/r/539532