Page MenuHomePhabricator

Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-fe1009-1012

Hostname / Racking / Installation Details

Hostnames: ms-fe1009-1012
Racking Proposal: Preferably same rows as the hosts getting replaced, refreshing ms-fe100[5-8]
Networking/Subnet/VLAN/IP: 10G production network
Partitioning/Raid: Same as existing ms-fe hosts
OS Distro: Stretch

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-fe1009:

  • - receive in system on procurement task T291972 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

ms-fe1010:

  • - receive in system on procurement task T291972 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

ms-fe1011:

  • - receive in system on procurement task T291972 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

ms-fe1012:

  • - receive in system on procurement task T291972 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

x[x] - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host

  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH created this task.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
wiki_willy renamed this task from (Need By: TBD) rack/setup/install ms-fe1009-1012 to Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012.Oct 22 2021, 9:51 PM

Servers added to netbox

@LSobanski Quick question we are limited in space to keep in same row as what we are replacing it ms-fe10[09..10] would share in same rack also ms-fe10[11..12] would share the same rack.

ms-fe100[5-8] are only split between two rows any reason we would not use other rows?

I see codfw is configured with 4 different racks so I don't see why we wouldn't do the same thing here.

cc @fgiunchedi in case there's something we don't know.

yes +1 to spread around rows as much as we can

@fgiunchedi I racked ms-fe1012 in the new cage e1. I believe it's going to be used to test the network in the cage for a little bit. Afterward do you want it to stay there or move it back to the old cage?

Change 758967 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new ms-fe1090[9]|1[0-2]) to site.pp

https://gerrit.wikimedia.org/r/758967

Change 758967 merged by Cmjohnson:

[operations/puppet@production] Adding new ms-fe1090[9]|1[0-2]) to site.pp

https://gerrit.wikimedia.org/r/758967

@fgiunchedi I racked ms-fe1012 in the new cage e1. I believe it's going to be used to test the network in the cage for a little bit. Afterward do you want it to stay there or move it back to the old cage?

When the network/cage are ready to be handed off I believe we can leave the host there, what do you think @MatthewVernon ?

I have no objection, as long as Netbox knows where it is :)

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1009.eqiad.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1010.eqiad.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1011.eqiad.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1009.eqiad.wmnet with OS stretch completed:

  • ms-fe1009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202101705_cmjohnson_15093_ms-fe1009.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1010.eqiad.wmnet with OS stretch completed:

  • ms-fe1010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202101706_cmjohnson_16888_ms-fe1010.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1011.eqiad.wmnet with OS stretch completed:

  • ms-fe1011 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202101712_cmjohnson_17602_ms-fe1011.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

@MatthewVernon @fgiunchedi @wiki_willy ms-fe1009-1011 are yours, ms-fe1012 is in the new cage and still be used for testing at the moment, once released I will be able to do the initial OS install.

@Cmjohnson thanks! Do you have an idea how long ms-fe1012 is going to be needed for testing, please?

++ @cmooney, who might be able to provide an answer on that. I think he's wrapping things up with testing though, so maybe about another week? Thanks, Willy

@Cmjohnson thanks! Do you have an idea how long ms-fe1012 is going to be needed for testing, please?

@Cmjohnson thanks! Do you have an idea how long ms-fe1012 is going to be needed for testing, please?

@MatthewVernon I have no specific need to this particular machine. I've probably a day or two of testing to complete, but I'm awaiting some other connectivity in the new cage to do so. Which unfortunately make a given date hard to provide.

If you need it urgently I would suggest we re-plan it for the existing rows and deploy there for now. Where it is (new cage, with totally new network architecture), it won't be available for use until we've completed all testing and have sign off to go live. That won't be long but I can't give a precise date right now, with any luck we'll get it over the line next week.

@cmooney thanks for that - if it's next week or so, I'm happy to wait.

[background: these 4 hosts are h/w refresh for swift frontends. So I can swap 3 now, and then come back when ms-fe1012 is ready, or wait and do all 4 at once. If it was going to be a long time, I should probably do the former, but as it's not it's less hassle for me to do all four together once ms-fe1012 is ready]

@MatthewVernon thanks for the feedback.

I'll know by mid-week if we are on target. Should be fine to have everything ready by next week for you. If there is any slippage or issue I'll let you know but should be fine.

Appreciate your patience :)

@cmooney don't forget that 1012 is in the new cage, it could take awhile to get that going.

Hi folks - I think @cmooney 's testing is blocked on a new cage being ready - is there a phab ticket for that (that this ticket could be linked to), so I don't have to keep coming back and asking for updates here, please? Maybe @Cmjohnson knows :)

@MatthewVernon hey!

My apologies I was supposed to feed back before now. We should be good to go ms-fe1012 now, there are a few other servers racked up also, as can be seen here:

https://w.wiki/4uWF

Apart from ms-fe1012 the others need the Netbox script run by DC-Ops, to add the right switch port and assign IP addresses. That's a real quick thing to do so I'm sure can be done quickly.

I probably should do a degree of "hand holding" with you for the server image / go live. We've a couple of live servers there, but I believe that the ms-fe* nodes might sit behind our LVS load balancers, is that correct? That shouldn't cause any problem but it'll make it the first real server behind LVS in the new racks. So I want to keep a close eye on it to make sure the process works for you as normal, and that all network elements work as expected.

But yes overall should be good to go, I'll touch base on irc about when you want to try the reimage on that first one.

Thanks. Yes, the ms-fe* nodes will end up behind LVS; but they're not in service at that point. So from my POV, whenever you (or DC team) are ready to image ms-fe1012, that's good for me, and I can have a look at it once it's installed, without any concern of it disrupting production.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1012.eqiad.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1012.eqiad.wmnet with OS stretch completed:

  • ms-fe1012 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203081757_cmjohnson_15680_ms-fe1012.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

Change 769382 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add role::insetup for ms-be1012

https://gerrit.wikimedia.org/r/769382

Change 769382 merged by Elukey:

[operations/puppet@production] Add role::insetup for ms-fe1012

https://gerrit.wikimedia.org/r/769382

elukey subscribed.

Hi Chris! There are a couple of issues with this task:

  1. The new hosts were added to site.pp with https://gerrit.wikimedia.org/r/758967, but there is a broader catch-all regex just above (node /^ms-fe1\d\d\d\.eqiad\.wmnet$/ ) that overrides the one added in your patch. I followed up with https://gerrit.wikimedia.org/r/c/operations/puppet/+/769382 to improve the current config.
  2. ms-fe1012 seems down from the icinga perspective, even if I can ssh. Pings and tcp/dns/etc.. are not working on the host, maybe there is still some extra config to add (I noticed that the node is on row E). Lemme know if you need more info :)

@elukey thanks for looking at this.

I am alarmed and not sure what was the cause of the network issues here. What seemed to be broken was that the top-of-rack switch connected (lsw1-e1-eqiad) could not resolve ARP for the machines IPv4 address. IPv6 did seem to be working.

I was scratching my head tbh. I cannot explain how the SSH worked if the switch had no ARP entry for the host. Nevertheless I shut/unshut the IRB interface on the switch (as this did not seem to be behaving normally), after which the switch properly processed the ARP responses from the host, and added an entry to its local ARP table. From there everything began working as expected, I can ping, make DNS requests etc. Very odd.

Prior to these going live I did extensive testing on the network in the new rows, and indeed used this very server for much of it, and connectivity was fine and uninterrupted for weeks. So I am (well it's Murphy's law) both worried and disappointed we've hit this buggy behaviour now.

For the time being I think we probably just need to keep a close eye on it. The lack of ARP is a relatively simple function, and its function isn't related to any of the new more complex config we've added here (for VXLAN/EVPN for instance). So its surprising to see such an issue, and as I say didn't hit anything like that in the previous tests).

EDIT: One thing worth mentioning is the IP of ms-fe1012, 10.64.130.2, was at times during testing configured on the top-of-rack switch itself on irb.1031. This was removed, but I'm wondering did we hit some odd bug where the switch still had it somewhere in a state table, which prevented it being learnt in ARP. And shutting the int cleared it out. Just speculation but including the detail as it may be relevant.

@cmooney thanks for the update. To be clear, do you think I'm OK to put this one back into swift::proxy now? I might then procrastinate actually bringing it into service until next week just to check no further gremlins appear :-)

Change 769443 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] site: ms-fe1012 no longer insetup

https://gerrit.wikimedia.org/r/769443

Change 769443 merged by MVernon:

[operations/puppet@production] site: ms-fe1012 no longer insetup

https://gerrit.wikimedia.org/r/769443

@MatthewVernon apologies for the late reply, I've been only working part-time the last few days as I'd been ill.

I think it is fine to proceed, but the decision to let it sit for a few days before going live was probably the most sensible. Seems stable in Icinga so I'm comfortable to dismiss it as a one-off bug we hit (probably caused by re-using that IP).

I don't think there is any need to delay further but given it's Friday maybe best to tackle next week. I've also changed the Netbox status of all those components to 'active', which should have been done earlier, to clear up any ambiguity on that side.

@MatthewVernon Just to follow up having checked all network interfaces, forwarding tables and the end devices all looks to be working fine with ms-fe1012 and indeed the traffic from lvs1019.

So anyway, as far as the network side of things I think we are ok. Thanks for your patience on this one :)

Great, thanks. I think we can close this now :)