Page MenuHomePhabricator

(Need By: TBD) rack/setup/install pc1011-pc1014
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of pc1011-pc1014

Hostname / Racking / Installation Details

Hostnames: pc1011-pc1014
Racking Proposal: One per row if possible
Networking/Subnet/VLAN/IP: 1g, internal vlan, single production and single mgmt connection. REMINDER: please don't create IPv6 entries for these hosts.
Partitioning/Raid: hw raid10 of all disks, db partition recipe
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

pc1011:

  • - receive in system on procurement task T281905 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac 5.00.00.00 , bios 2.11.2 , 1g network 21.80.9, 10g network 21.80.16.95, h730 raid controller 25.5.8.0001)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

pc1012:

  • - receive in system on procurement task T281905 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac 5.00.00.00 , bios 2.11.2 , 1g network 21.80.9, 10g network 21.80.16.95, h730 raid controller 25.5.8.0001)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

pc1013:

  • - receive in system on procurement task T281905 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac 5.00.00.00 , bios 2.11.2 , 1g network 21.80.9, 10g network 21.80.16.95, h730 raid controller 25.5.8.0001)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

pc1014:

  • - receive in system on procurement task T281905 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac 5.00.00.00 , bios 2.11.2 , 1g network 21.80.9, 10g network 21.80.16.95, h730 raid controller 25.5.8.0001)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - no link on network port (checked both on switch and via idrac https interface for details pc1014 D6 u36 port35 Cableid#23000064 ) - perhaps the cable isn't seated?
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH edited subscribers, added: LSobanski; removed: RobH.

Change 698177 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Set the partitioning scheme to new pc*

https://gerrit.wikimedia.org/r/698177

Change 698177 merged by Marostegui:

[operations/puppet@production] install_server: Set the partitioning scheme to new pc*

https://gerrit.wikimedia.org/r/698177

Change 699424 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Add new parsercache hosts as insetup

https://gerrit.wikimedia.org/r/699424

Change 699424 merged by Marostegui:

[operations/puppet@production] site.pp: Add new parsercache hosts as insetup

https://gerrit.wikimedia.org/r/699424

Hi. Do we have an idea for when these hosts could be available? We have ongoing issues with parsercache (see T282761) that we hope moving to the new HW will partially mitigate.

Received host added to netbox

This comment was removed by Jclark-ctr.

pc1011 A1 u21 port 9 Cableid#3963
pc1012 B1 u28 port17 Cableid#3947
pc1013 C5 u26 port24 Cableid#3410
pc1014 D6 u36 port35 Cableid#23000064

pc1011 A1 u21 port 9 Cableid#3963 IP 10.65.1.187
pc1012 B1 u28 port17 Cableid#3947 IP 10.65.1.188
pc1013 C5 u26 port24 Cableid#3410 IP 10.65.1.189
pc1014 D6 u36 port35 Cableid#23000064 IP 10.65.1.190

Jclark-ctr added subscribers: RobH, Jclark-ctr.

@RobH only script in netbox has been run. bios/drac/serial setup have been configured.

And host have been powered off

There are some issues with this task:

  1. The hosts have been provisioned on Netbox with public IPs, see https://netbox.wikimedia.org/ipam/ip-addresses/?q=pc101
    Screenshot 2021-07-12 at 14.37.01.png (308×1 px, 78 KB)
  2. The sre.dns.netbox cookbook has not been run after the provisioning, this in turn has created the following issues:
    • The Icinga alert Uncommitted DNS changes in Netbox has fired and went unnoticed
    • Blocking future runs of the cookbook because the people running it will encounter an unwanted diff and might halt before proceeding being hence blocked or will go ahead and merge the changes creating wrong DNS records
  3. Homer was not run for the affected switches, that in turn creates the following issues:
    • An email alert to rancid-core@ titled [Homer] Device live config differs from committed one
    • Pending changes on the switches that might block other people that will run homer finding the spurious diff they are not aware of.

I'm reverting the primary IP provisioning of those hosts and leaving only the already setup mgmt IPs and I'm running the cookbook to create the mgmt DNS records so to unblock the situation for everyone else.

Please be extra careful with the provisioning and do not let changes in Netbox sit without running the dns cookbook and homer right after. Also please keep an eye for the Uncommitted DNS changes in Netbox Icinga alert.
As an improvement we could make a cookbook that does all this in one single step.

Mentioned in SAL (#wikimedia-operations) [2021-07-12T12:42:20Z] <volans> reverting Primary IP allocation for pc1011-1014, leaving only mgmt IPs - T282484

I've updated the previous message with the homer part too. Given that we have the pending spurious changes that will open the ports for those hosts with the wrong vlan I'm reverting that part too in Netbox.

Those are the config I'm reverting if they are needed when it will be re-provisioned:

HostCable IDSwitchPort
pc10113963asw2-a1-eqiadge-1/0/9
pc10123947asw2-b1-eqiadge-1/0/17
pc10133410asw2-c5-eqiadge-5/0/24
pc101423000064asw2-d6-eqiadge-6/0/36

P.S. the cable ID 23000064 looks like a potential typo.

P.S. the cable ID 23000064 looks like a potential typo.

Not a typo, we just started ordering pre-serial-labeled network cables. Since its done at the factory, they are way longer numbers than what we printed ourselves.

Change 704854 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] pc101[1-4] mac addresses

https://gerrit.wikimedia.org/r/704854

Change 704854 merged by RobH:

[operations/puppet@production] pc101[1-4] mac addresses

https://gerrit.wikimedia.org/r/704854

RobH removed a project: Patch-For-Review.
RobH added a project: Data-Persistence.

P.S. the cable ID 23000064 looks like a potential typo.

Not a typo, we just started ordering pre-serial-labeled network cables. Since its done at the factory, they are way longer numbers than what we printed ourselves.

Ack, good to know, thanks. Please let me know in case those new numbers trigger any false positive in any Netbox report or other data validation that we have.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['pc1011.eqiad.wmnet', 'pc1012.eqiad.wmnet', 'pc1013.eqiad.wmnet', 'pc1014.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107151934_robh_10629.log.

Completed auto-reimage of hosts:

['pc1014.eqiad.wmnet']

Of which those FAILED:

['pc1014.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

pc1014.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107152043_robh_32404_pc1014_eqiad_wmnet.log.

RobH updated the task description. (Show Details)

Assigning this to Chris to check the network connection for pc1014. checklist has been updated for that host, and it should be good to run the reimage script as soon as its network connection is working.

Completed auto-reimage of hosts:

['pc1014.eqiad.wmnet']

Of which those FAILED:

['pc1014.eqiad.wmnet']
wiki_willy added a subscriber: Cmjohnson.

Moving over to @Jclark-ctr to check the network on pc1014. Thanks, Willy

@RobH Sorry for delay cable was not seated completely in switch. fixed has link now

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

pc1014.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107221235_kormat_1476_pc1014_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['pc1014.eqiad.wmnet']

and were ALL successful.

@RobH : i took the liberty to reimage pc1014 as the network connection is now working. I've also set it to 'staged' in netbox, but i'll leave this task for you to resolve in case there's anything more that has to be done. Cheers.

All good, task resolved.