Page MenuHomePhabricator

(Need By: TBD) rack/setup/install ms-be20[58-61]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

These will replace hosts ms-be[2016-2027] and expand ms-be service ability.

Hostname / Racking / Installation Details

Hostnames: ms-be20[58-61]
Racking Proposal: One host per row. Better not to share racks with existing ms-be hosts (that we are not refreshing), but if we can't avoid it that's fine too.
Networking/Subnet/VLAN/IP: 10G private vlan
Partitioning/Raid: existing partman recipe as other ms-be hsots
OS Distro: Stretch

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-be2058: Row C rack C4 U1/U2 xe-4/0/0

  • - receive in system on procurement task T264139 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ms-be2059: row D rack D7 U3/U4 xe-7/0/1

  • - receive in system on procurement task T264139 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ms-be2060: Row A rack A4 U16/U17 xe-4/0/25

  • - receive in system on procurement task T264139 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ms-be2061: Row D rack D2 U2/U3 xe-2/0/1

  • - receive in system on procurement task T264139 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a parent task: Unknown Object (Task).Oct 13 2020, 10:05 PM
RobH removed a subscriber: RobH.

Change 643320 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP Add MAC address for ms-be205[8-9] and ms-be206[0-1]

https://gerrit.wikimedia.org/r/643320

Change 643320 merged by Papaul:
[operations/puppet@production] DHCP Add MAC address for ms-be205[8-9] and ms-be206[0-1]

https://gerrit.wikimedia.org/r/643320

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2059.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011241956_pt1979_27659_ms-be2059_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2059.codfw.wmnet']

Of which those FAILED:

['ms-be2059.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2059.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011242017_pt1979_32176_ms-be2059_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2059.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2059.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011302001_pt1979_1719_ms-be2059_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2059.codfw.wmnet']

Of which those FAILED:

['ms-be2059.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2059.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011302027_pt1979_6775_ms-be2059_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2058.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012010032_pt1979_18452_ms-be2058_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2058.codfw.wmnet']

Of which those FAILED:

['ms-be2058.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2058.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012010033_pt1979_18501_ms-be2058_codfw_wmnet.log.

@fgiunchedi I re-imaged ms-be2059 and ms-be2058, puppet is not happy

WARNING: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1]

Can you please take a look and see why puppet is not happy so I can resume the install on the other 2 hosts?

Thanks

Completed auto-reimage of hosts:

['ms-be2058.codfw.wmnet']

Of which those FAILED:

['ms-be2058.codfw.wmnet']

@fgiunchedi I re-imaged ms-be2059 and ms-be2058, puppet is not happy

WARNING: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1]

Can you please take a look and see why puppet is not happy so I can resume the install on the other 2 hosts?

Thanks

So two different failure modes AFAICS:

  • ms-be2058 for some reason has got its sdc and sdb swapped post debian-installer:
ms-be2058:~$ cat /proc/partitions 
major minor  #blocks  name

   8        0  468320256 sda
   8        1   58592256 sda1
   8        2     976896 sda2
   8        3   97655808 sda3
   8        4  311093248 sda4
   8       32  468320256 sdc
   8       33   58592256 sdc1
   8       34     976896 sdc2
   8       35   97655808 sdc3
   8       36  311093248 sdc4
...
   8       16 7813464064 sdb

So the installer completed correctly with sda/sdb being the ssd, but then on reboot the disks were renumbered and puppet is confused by this fact. I tested a reboot of ms-be2058 to see if it'd restore the disk order, but the host isn't coming back.

  • ms-be2059 it looks like there was an existing vfat filesystem on sdc1 and that's what confused puppet. I wiped the filesystem and now puppet's happy again:
ms-be2059:~$ sudo wipefs -a /dev/sdc1
/dev/sdc1: 8 bytes were erased at offset 0x00000052 (vfat): 46 41 54 33 32 20 20 20
/dev/sdc1: 1 byte was erased at offset 0x00000000 (vfat): eb
/dev/sdc1: 2 bytes were erased at offset 0x000001fe (vfat): 55 aa
ms-be2059:~$ sudo puppet agent --test --verbose
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for ms-be2059.codfw.wmnet
Info: Applying configuration version '(0254fe8eb7) Elukey - hadoop: Migrate hiera() to lookup() and setting datatype in spark2'
Notice: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdc]/Exec[mkfs-/dev/sdc1]/returns: executed successfully
Notice: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdc]/Swift::Mount_filesystem[/dev/sdc1]/Exec[mountpoint-root-/srv/swift-storage/sdc1]/returns: executed successfully
Notice: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdc]/Swift::Mount_filesystem[/dev/sdc1]/Mount[/srv/swift-storage/sdc1]/ensure: defined 'ensure' as 'defined'
Info: Computing checksum on file /etc/fstab
Info: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdc]/Swift::Mount_filesystem[/dev/sdc1]/Mount[/srv/swift-storage/sdc1]: Scheduling refresh of Mount[/srv/swift-storage/sdc1]
Notice: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdc]/Swift::Mount_filesystem[/dev/sdc1]/Mount[/srv/swift-storage/sdc1]: Triggered 'refresh' from 1 event
Info: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdc]/Swift::Mount_filesystem[/dev/sdc1]/Mount[/srv/swift-storage/sdc1]: Scheduling refresh of Mount[/srv/swift-storage/sdc1]
Notice: Applied catalog in 19.97 seconds

Hope that helps!

@fgiunchedi thanks . ms-be2058 has memory error the same DIMM we were having problem with on msbe2057 was used in ms-be2058 so I will go ahead and ask for replacement . But the server is back up.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011422_pt1979_9982_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

Of which those FAILED:

['ms-be2060.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011433_pt1979_11336_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

Of which those FAILED:

['ms-be2060.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011506_pt1979_16777_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

Of which those FAILED:

['ms-be2060.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011819_pt1979_22321_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

Of which those FAILED:

['ms-be2060.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011915_pt1979_32688_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

Of which those FAILED:

['ms-be2060.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011930_pt1979_2357_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

Of which those FAILED:

['ms-be2060.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2061.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012012121_pt1979_23205_ms-be2061_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2061.codfw.wmnet']

Of which those FAILED:

['ms-be2061.codfw.wmnet']

Change 644803 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] Add new ms-be nodes to site.pp

https://gerrit.wikimedia.org/r/644803

Change 644803 merged by Papaul:
[operations/puppet@production] Add new ms-be nodes to site.pp

https://gerrit.wikimedia.org/r/644803

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012021401_pt1979_12323_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

Of which those FAILED:

['ms-be2060.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012021401_pt1979_12379_ms-be2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2060.codfw.wmnet']

and were ALL successful.

Change 644847 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] site: assign roles for all ms-be / ms-fe hosts

https://gerrit.wikimedia.org/r/644847

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2061.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012021539_pt1979_4680_ms-be2061_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2061.codfw.wmnet']

Of which those FAILED:

['ms-be2061.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ms-be2061.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012021606_pt1979_12228_ms-be2061_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ms-be2061.codfw.wmnet']

and were ALL successful.

Change 644847 merged by Filippo Giunchedi:
[operations/puppet@production] site: assign roles for all ms-be / ms-fe hosts

https://gerrit.wikimedia.org/r/644847