Page MenuHomePhabricator

replace ulsfo aging servers
Closed, ResolvedPublic

Description

This task will serve as the overall tracking and planning task to replace the systems that have aged out of warranty @ ulsfo.

The systems have been ordered on procurement tasks T160934 & T160936.

There is not room on the power overhead to cable up the new systems and keep the old ones online. We'll likely need to offline ulsfo for a number of days, if possible.

CP systems replacement:

  • - receive in new systems cp4021-cp4032 from T160934, verify hardware is correct. This step will likely require the systems to be racked in the top of the racks, as offlining ulsfo on 2017-05-03 is non-ideal.
  • - offline ulsfo, decommission existing cp systems
  • - wipe disks on cp4001-cp4020.
  • - decom/wipe of systems will take place in sub-tasks:
  • - T169020 Decommission cp400[1-4]
  • - T167377 Decommission cp4011, cp4012, cp4019, cp4020
  • - unrack systems after wipe, arrange for sale to IT refurbishment. this may result in storage fees with UnitedLayer.
  • - place new cp systems in old cp system spots, reuse all power and data cabling. ALTERNATIVE: Replace the fiber optic runs in the rack with DAC cables.
  • - setup/install new cp4021-cp4032

Related Objects

Event Timeline

Yeah we should discuss our options a bit here re: minimizing ulsfo downtime, I think we have a few options for how we arrange this. There some complicating factors with the misc and maps clusters: the new hardware config assumes we've done the software work to fold those into the primary clusters. The reason we didn't block is that the backup plan is to simply not have misc and maps endpoints in ulsfo until the software side of the work is done.

Change 351659 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] r::c::base - support 800G disks for new ulsfo systems

https://gerrit.wikimedia.org/r/351659

IRC update:

Per @BBlack's request, we'll split the numbering of odd and even in the odd and even racks. so rack 1.23 has cp2021, 1.22 has cp2022

rackhostname
1.23cp2021
1.22cp2022
1.23cp2023
1.22cp2024
1.23cp2025
1.22cp2026
1.23cp2027
1.22cp2028
1.23cp2029
1.22cp2030
1.23cp2031
1.22cp2032

This will allow it so the ranges, when split into service groups, will be in sequence. Example: cp2021-2024 are text (split evenly between both racks).

Change 351711 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] setting up dns for cp4021-cp4032

https://gerrit.wikimedia.org/r/351711

Change 351711 merged by RobH:
[operations/dns@master] setting up dns for cp4021-cp4032

https://gerrit.wikimedia.org/r/351711

cp4021-cp4032 have been racked, but ONLY cp4021 is accessible to mgmt and network. There isnt enough power overhead in the racks to wire up all the new systems.

cp4021 is only wired up for testing and provisioning purposes, not to go fully live, since it wont stay where it is in the rack once other systems are decommissioned.

Change 351740 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] add cp4021 dhcp and temporary site.pp entry

https://gerrit.wikimedia.org/r/351740

Change 351740 merged by BBlack:
[operations/puppet@production] add cp4021 dhcp and temporary site.pp entry

https://gerrit.wikimedia.org/r/351740

Change 351659 merged by BBlack:
[operations/puppet@production] r::c::base - support 800G disks for new ulsfo systems

https://gerrit.wikimedia.org/r/351659

Change 351742 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp4021: use macaddr from second 10Gbps interface

https://gerrit.wikimedia.org/r/351742

Change 351742 merged by BBlack:
[operations/puppet@production] cp4021: use macaddr from second 10Gbps interface

https://gerrit.wikimedia.org/r/351742

ok I'm installing jessie onto cp4021 now (just to test configuration issues and patch up puppet for the real installs later!). Things I found while trying to boot:

  1. The host has 2x separate Ctrl-S prompts as it boots, for ethernet card setup. The first says Broadcom and is the embedded ports, the second says QLogic and is the 10G ports
  2. The QLogic ports both had PXE disabled (I enabled on both)
  3. The cable is in what the QLogic setup considers the second port (so I updated our dhcp to use that for now)
  4. I went ahead and disabled the onboard Broadcom 1G ports in BIOS, using the setting Disabled (OS), we'll see what the runtime effect is under Linux (they still show boot menus and try to PXE of course).

With the cable in the second port + T164444 we're blocked on getting a successful test install. @RobH is asking smart hands to swap the cable, and I'll proceed once one of those two issues is resolved.

With the cable in the second port + T164444 we're blocked on getting a successful test install. @RobH is asking smart hands to swap the cable, and I'll proceed once one of those two issues is resolved.

Done by smart hands.

Change 356605 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

Change 356605 merged by BBlack:
[operations/puppet@production] LVS: new redundancy layout for new eqiad ulsfo hosts

https://gerrit.wikimedia.org/r/356605

Change 361777 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] cache_misc: take ulsfo IPs out of effective service

https://gerrit.wikimedia.org/r/361777

RobH changed the status of subtask T171967: setup/install cp4022 from Stalled to Open.Aug 1 2017, 4:46 PM

cp402[1-8] are all racked and ready for use.

Excellent news! I'll try to squeeze in replacing one of the clusters ASAP, which will decom another 6x of the old cp to let us move further.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo.wmnet', 'cp4024.ulsfo.wmnet', 'cp4025.ulsfo.wmnet', 'cp4026.ulsfo.wmnet', 'cp4027.ulsfo.wmnet', 'cp4028.ulsfo.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201708231516_bblack_29684.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo.wmnet', 'cp4024.ulsfo.wmnet', 'cp4025.ulsfo.wmnet', 'cp4026.ulsfo.wmnet', 'cp4027.ulsfo.wmnet', 'cp4028.ulsfo.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201708231533_bblack_3331.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo.wmnet', 'cp4024.ulsfo.wmnet', 'cp4025.ulsfo.wmnet', 'cp4026.ulsfo.wmnet', 'cp4027.ulsfo.wmnet', 'cp4028.ulsfo.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201708301648_bblack_28334.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp4022.ulsfo.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201708301749_bblack_15395.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp4022.ulsfo.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201708311338_bblack_31014.log.

Completed auto-reimage of hosts:

['cp4022.ulsfo.wmnet']

and were ALL successful.

Recapping where we're at on all things here, because even I get lost sometimes:

Of the old hosts being decommed, the only one still in live use are:

  • bast4001 (blocking on bast4002 going into service)
  • lvs400[1234] (blocking on comfort level with lvs400[567], which are already in service)

Of the new hosts, everything but bast4002 and dns400[12] are already in active service. These are installed as stretch w/ spare::system role for now, waiting for someone to work out various software details and attempt bringing them into service.

All new systems are in place, resolving this task.