Page MenuHomePhabricator

rack/setup/install cp1075-cp1090
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of 16 new cp systems ordered via T193911. Please note these will need a few things done differently than past cp systems, mainly their partitioning is going to differ.

Racking Proposal: Rack the 16 systems with 4 per row, in 10G racks. Attempt to space evenly across the 10G racks in each row. No consideration needs to be given to any of the existing cp systems, as these will replace ALL except cp1008 (test host) and cp1071-1074 (these will eventually go away well after new systems are online, but will be well after all other cp system decoms.) Ideally, 2 per rack, as one of each will be in text or upload. (This would keep things simpler for traffic, if possible.)

old cp decom note: NONE of the existing hosts will be decommissioned until after these new hosts are fully online and deployed.

cp1075-1090:

  • - receive in system on procurement task T193911
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) - PLEASE NOTE: A new partman recipe will need to be written for this system. Feel free to escalate to @RobH for this part.
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH renamed this task from rack/setup/install cp1075-cp1091 to rack/setup/install cp1075-cp1090.May 29 2018, 10:04 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
BBlack mentioned this in Unknown Object (Task).Jun 26 2018, 4:58 PM
Vvjjkkii renamed this task from rack/setup/install cp1075-cp1090 to 00baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Cmjohnson; removed: Aklapper.
CommunityTechBot renamed this task from 00baaaaaaa to rack/setup/install cp1075-cp1090.Jul 2 2018, 2:34 AM
CommunityTechBot assigned this task to Cmjohnson.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: Aklapper; removed: Cmjohnson.

Change 446619 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cp10[75-90]

https://gerrit.wikimedia.org/r/446619

Change 446619 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cp10[75-90]

https://gerrit.wikimedia.org/r/446619

@ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vlan in the other rows did not automatically enable the ports. Can you also check please.

xe-4/0/11 cp1075
xe-4/0/13 cp1076
xe-7/0/30 cp1077
xe-7/0/31 cp1078

@ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vlan in the other rows did not automatically enable the ports. Can you also check please.

@Cmjohnson Public vlan? I highly doubt these should go in the public vlan, the older systems are in private as well...

Thanks @mark fixing now. I looked up one other and it must've been for something else. I believe it was cp1008

Change 447816 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Add ipv4/ipv6 dns cp1075-1090

https://gerrit.wikimedia.org/r/447816

Cmjohnson added a subscriber: Cmjohnson.

@RobH can you take over the installs from here. I did do production dns, please review and merge if okay. I am not seeing a physical link with the switch, I disabled the ethernet ports, made sure the 10G ports are pxe ready. The bios sees the broadcom 10G port in bios order. @ayounsi confirmed switch looked okay

Thanks

I'll take over these from here. It's a very new hardware config we'll have to develop some puppet-level fixups for as we test how the installation and runtime works out.

Change 447816 merged by BBlack:
[operations/dns@master] Add ipv4/ipv6 dns cp1075-1090

https://gerrit.wikimedia.org/r/447816

What I know so far from testing on cp1075:

  • The various BIOS settings seem fine so far, I didn't have to change anything in BIOS or NIC or controller firmware setups
  • The jessie installer's 3.16 kernel lacks at least the bnxt_en to bring up the NIC, possibly also missing some storage drivers (e.g. for the nvme). So for now cp1075 is booted into a base (unpuppeted) install of stretch for exploration. Probably everything would be fine on our much-newer current jessie runtime kernel, too.
  • The disk naming is /dev/sd[ab] for the 2x 240GB SSDs intended for the root disks
  • The singular 1.6TB nvme drive for cache storage ends up at /dev/nvme0n1

TODO (tomorrow I think):

  • New partman recipe and late_command.sh changes for these nodes (sd[ab] as mirrored rootfs and nothing else, nvme0n1 as a giant ext4 cache fs)
  • Puppet cache storage config changes for the nvme drive
  • Get the RPS tooling and dependencies installed on cp1075 manually, so we can figure out how different things are for bnxt's IRQ routing/counts and push any necessary changes there to puppet as well
  • Make some decisions about whether we're going to start our cp* -> stretch transition with (because of) these nodes, or find some other hacky way to stay on jessie a little longer

Change 448076 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] storage config tweaks for cp1075-99

https://gerrit.wikimedia.org/r/448076

Change 448076 merged by BBlack:
[operations/puppet@production] storage config tweaks for cp1075-99

https://gerrit.wikimedia.org/r/448076

Change 448300 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] installer late_command: run nvme inside target

https://gerrit.wikimedia.org/r/448300

Change 448300 merged by BBlack:
[operations/puppet@production] installer late_command: run nvme inside target

https://gerrit.wikimedia.org/r/448300

Change 448306 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] late_command: fix sfdisk path

https://gerrit.wikimedia.org/r/448306

Change 448306 merged by BBlack:
[operations/puppet@production] late_command: fix sfdisk path

https://gerrit.wikimedia.org/r/448306

Change 448506 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp/lvs: add bnxt_en support in NIC tuning stuff

https://gerrit.wikimedia.org/r/448506

Change 448507 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075-90: numa_networking: on

https://gerrit.wikimedia.org/r/448507

Change 448508 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075-99: define in site.pp

https://gerrit.wikimedia.org/r/448508

Change 448530 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075-90: cache storage parameterization

https://gerrit.wikimedia.org/r/448530

Change 448507 merged by BBlack:
[operations/puppet@production] cp1075-90: numa_networking: on

https://gerrit.wikimedia.org/r/448507

Change 448506 merged by BBlack:
[operations/puppet@production] cp/lvs: add bnxt_en support in NIC tuning stuff

https://gerrit.wikimedia.org/r/448506

Change 448530 merged by BBlack:
[operations/puppet@production] cp1075-90: cache storage parameterization

https://gerrit.wikimedia.org/r/448530

Change 448508 merged by BBlack:
[operations/puppet@production] cp1075-90: define in site.pp

https://gerrit.wikimedia.org/r/448508

Change 449184 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cacheproxy: enable scsi_mod.use_blk_mq

https://gerrit.wikimedia.org/r/449184

Change 449184 merged by BBlack:
[operations/puppet@production] cacheproxy: enable scsi_mod.use_blk_mq

https://gerrit.wikimedia.org/r/449184

Change 449187 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075: add to conftool/hieradata node lists

https://gerrit.wikimedia.org/r/449187

Change 449187 merged by BBlack:
[operations/puppet@production] cp1075: add to conftool/hieradata node lists

https://gerrit.wikimedia.org/r/449187

Change 449202 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075-99: define storage size

https://gerrit.wikimedia.org/r/449202

Change 449202 merged by BBlack:
[operations/puppet@production] cp1075-99: define storage size

https://gerrit.wikimedia.org/r/449202

Change 449233 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] avoid data=writeback on nvme formatted w/o journal

https://gerrit.wikimedia.org/r/449233

Change 449233 merged by BBlack:
[operations/puppet@production] avoid data=writeback on nvme formatted w/o journal

https://gerrit.wikimedia.org/r/449233

Change 449466 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075-99: further mkfs tweaks

https://gerrit.wikimedia.org/r/449466

Change 449466 merged by BBlack:
[operations/puppet@production] cp1075-99: further mkfs tweaks

https://gerrit.wikimedia.org/r/449466

Change 450029 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075-99: add to hieradata, conftool-data, acls

https://gerrit.wikimedia.org/r/450029

Change 450029 merged by BBlack:
[operations/puppet@production] cp1075-99: add to hieradata, conftool-data, acls

https://gerrit.wikimedia.org/r/450029

Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp1076.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201808021722_bblack_13684.log.

Completed auto-reimage of hosts:

['cp1076.eqiad.wmnet']

and were ALL successful.

Change 450073 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] base check_disk_options: exclude /srv/nvme mounts

https://gerrit.wikimedia.org/r/450073

Change 450073 merged by BBlack:
[operations/puppet@production] base check_disk_options: exclude /srv/nvme mounts

https://gerrit.wikimedia.org/r/450073

Change 450087 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1075-90: add rest of macaddrs

https://gerrit.wikimedia.org/r/450087

Change 450087 merged by BBlack:
[operations/puppet@production] cp1075-90: add rest of macaddrs

https://gerrit.wikimedia.org/r/450087

Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp1077.eqiad.wmnet', 'cp1078.eqiad.wmnet', 'cp1079.eqiad.wmnet', 'cp1080.eqiad.wmnet', 'cp1081.eqiad.wmnet', 'cp1082.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201808021947_bblack_13239.log.

Change 450148 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] smart-data-dump: support nvme

https://gerrit.wikimedia.org/r/450148

Completed auto-reimage of hosts:

['cp1080.eqiad.wmnet']

Of which those FAILED:

['cp1080.eqiad.wmnet']

Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp1083.eqiad.wmnet', 'cp1084.eqiad.wmnet', 'cp1085.eqiad.wmnet', 'cp1086.eqiad.wmnet', 'cp1087.eqiad.wmnet', 'cp1088.eqiad.wmnet', 'cp1089.eqiad.wmnet', 'cp1090.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201808022050_bblack_29952.log.

Change 450148 merged by BBlack:
[operations/puppet@production] smart-data-dump: support nvme

https://gerrit.wikimedia.org/r/450148

Most of these are installed now, but 2x have initial hardware issues:

  • cp1080 - Reports uncorrectably-bad DIMM in slot A5 on bootup and halts
  • cp1085 - Seems to have ethernet interface hardware issues, likely bad SFP/DAC/Cable whatever

Hmmm maybe subtasks are better, setting some of those up: T201174 + T201175

Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp1085.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201808031349_bblack_13801.log.

Completed auto-reimage of hosts:

['cp1085.eqiad.wmnet']

and were ALL successful.

BBlack updated the task description. (Show Details)

These are fully in-service. Will file separate ticket(s) about decomming various older cp10xx machines.