Page MenuHomePhabricator

Q4:(Need By: TBD) rack/setup/install cloudweb100[34]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: cloudweb100[3-4].eqiad.wmnet
Racking Proposal: Place in WMCS racks. Place in separate rows. Can't be placed in E/F.
Networking/Subnet/VLAN/IP: 1 10G connections per server. Requires public VLAN
Partitioning/Raid: 2dev standard raid1
OS Distro: Bullseye

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudweb1003:
  • - receive in system on procurement task T303424 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudweb1004:
  • - receive in system on procurement task T303424 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH renamed this task from (Need By: TBD) rack/setup/install cloudweb100[34] to Q4:(Need By: TBD) rack/setup/install cloudweb100[34].Apr 4 2022, 10:41 PM
RobH created this task.
RobH mentioned this in Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH moved this task from Backlog to Racking / Decom on the cloud-services-team (Hardware) board.
RobH added a parent task: Unknown Object (Task).

Comment moved from T303424#7830897:
From https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy#cloud[lab]web*

These hosts are behind the misc varnish cluster and could be considered for moving into private address space at a later date.
TODO: It is unclear whether current defined best practice requires these hosts to be in the public address space. They are now because of connectivity requirements. Cloud[lab)web* requires the ability to query nova-api which is restricted from private production VLANs. Because of this requirement they are in public address space. The cloudcontrol* refresh and Neutron deployment moved the nova-api service to cloudcontrol* hosts instead of cloudnet*. This should be reevaluated.

@Andrew Maybe I can help with this? We're trying to reduce our public vlan footprint in favor of load balancers front-ends, those hosts might be good candidates as they are already behind LVS :)
Did the situation change since the doc was written?
What kind of flows are the the nova-api queries? (source, dest, protocols, etc)
For example the cloudcontrol hosts are in the public vlan, so hosts in the private vlan can reach them no problem.
We also have the webproxies for the private hosts to use to query endpoints outside of prod.


See comment in T303424#7830897, they *might* be able to go in any 10G rack, private vlan.

Regardless, those are prod hosts (public/private) so they should not go in WMCS racks.

Cloudwebs need access to various OS APIs. Most of them are hosted in the production realm and should be accessible from any production VLAN without cross-realm traffic issues.

The issue comes from the Puppet ENC and webproxy APIs which are hosted on Cloud VPS VMs (in the cloudinfra and project-proxy projects, respectively). Right now that's done by using public vlans. Using the squids might be possible after the legacy IP-based access control those use is changed to proper keystone authentication (T295234, T274666).

Noted! To keep track of the IRC conversation, echoing it here:

is that a hard blocker? or could it be fixed before those hosts are live? It would be nice to not block public IPs until the next refresh :)

Edit: note that there is also work being done towards adding ACLs on the proxies themselves, which could help here (see T300977). But so far it's only at the discussion phase.

FYI, @ayounsi, our mid-term goal is to eliminate the need for this hardware entirely.

  • Wikitech needs to move to the mediawiki cluster, somehow (T237773)
  • Once openstack APIs are fully opened up (T267194 + some security thoughts) Horizon can run on a VM or a general-purpose k8s cluster
  • Striker can probably run on a VM whenever we feel like making the move.

Because of that long-term goal I don't think we should spend much time refactoring the current hardware-based deployment pattern.

Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned? (I even see patches ready for reviews!) If so that would be greatly appreciated. Of course I understand it's difficult to compare the workload/benefits that adds/gain in the big WMF picture :)

Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned? (I even see patches ready for reviews!) If so that would be greatly appreciated. Of course I understand it's difficult to compare the workload/benefits that adds/gain in the big WMF picture :)

Please don't take anything I am working on my free time as any sort of official WMCS team priority :-) In this case T274666 and its subtasks had been in my todo list on a while and https://gerrit.wikimedia.org/r/c/operations/puppet/+/781950 was easily combined with other clean-up that needs to happen there. But yes, as long as people review my patches this should be fully doable in the short term.

Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned?

Mostly, I'd like work that's spent on these three deployments (striker/horizon/wikitech) to be directed towards the final goal rather than towards temporary side-trips. The work that Taavi is doing certainly would be moving us towards the long-term end state, but basically any time I spend thinking about wikitech /on/ labweb hosts feels wasted.

There is some oauth and 2fa interaction between striker and wikitech that may make separating striker more complicated than I'm imagining; cc'ing @bd808 for his thoughts on that since he's already working on other Striker projects.

cloudweb1003 c8 u39 20220099 port 10 (cloudsw2-c8-eqiad)
cloudweb1004 d5 u16 20220109 port10 (cloudsw2-d5-eqiad)

@ayounsi @taavi @Andrew Has a determination on public vs private VLAN been decided?

@Cmjohnson you should continue to use public VLAN for this. Decisions around changing the architecture of the service shouldn't delay refreshing these machines.

Agreed!

However,

cloudweb1003 c8 u39 20220099 port 10 (cloudsw2-c8-eqiad)
cloudweb1004 d5 u16 20220109 port10 (cloudsw2-d5-eqiad)

Doesn't match my previous comment:

Regardless, those are prod hosts (public/private) so they should not go in WMCS racks.

Cmjohnson subscribed.

@Jclark-ctr Can you move these servers out of wmcs rack and into a 10G rack. there is space in B2, D2, D7

@ayounsi host will be moved tomorrow morning When i started racking task i went by Racking Proposal: Place in WMCS racks. Place in separate rows. Can't be placed in E/F.

Which was not updated

cloudweb1003 B2 u3 20220205 port34
cloudweb1004 D2 u33 20220109 port33

Change 813697 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding cloudweb1003/4 to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/813697

Change 813697 merged by Cmjohnson:

[operations/puppet@production] Adding cloudweb1003/4 to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/813697

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye executed with errors:

  • cloudweb1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye executed with errors:

  • cloudweb1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye executed with errors:

  • cloudweb1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

From diffscan, those two hosts have their SSH port exposed to the world:

New Open Service List
---------------------
STATUS HOST PORT PROTO OPREV CPREV DNS
OPEN 208.80.154.150 22 tcp 0 6 cloudweb1003.wikimedia.org
OPEN 208.80.155.117 22 tcp 0 6 cloudweb1004.wikimedia.org

Change 814185 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] updating site.pp for cloudweb servers, setup incorrectly for private vlan

https://gerrit.wikimedia.org/r/814185

Change 814185 merged by Cmjohnson:

[operations/puppet@production] updating site.pp for cloudweb servers, setup incorrectly for private vlan

https://gerrit.wikimedia.org/r/814185

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye completed:

  • cloudweb1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207151719_cmjohnson_3607796_cloudweb1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye completed:

  • cloudweb1004 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207151731_cmjohnson_3610958_cloudweb1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

Change 815378 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Put cloudweb100[34] into service

https://gerrit.wikimedia.org/r/815378

Change 815378 merged by Andrew Bogott:

[operations/puppet@production] Put cloudweb100[34] into service

https://gerrit.wikimedia.org/r/815378

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudweb1003.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudweb1004.wikimedia.org with OS buster

I'm reimaging these with Buster because mediawiki isn't really supported on Bullseye yet.

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudweb1003.wikimedia.org with OS buster completed:

  • cloudweb1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207211516_andrew_2639298_cloudweb1003.out
    • Checked BIOS boot parameters are back to normal
    • Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudweb1004.wikimedia.org with OS buster completed:

  • cloudweb1004 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207211516_andrew_2639302_cloudweb1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 816026 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:mariadb::grants: add cloudweb1003/1004 grants

https://gerrit.wikimedia.org/r/816026

From diffscan:

STATUS HOST PORT PROTO OPREV CPREV DNS
OPEN 208.80.154.150 7443 tcp 0 6 cloudweb1003.wikimedia.org
OPEN 208.80.155.117 7443 tcp 0 6 cloudweb1004.wikimedia.org

Not sure if expected, but there is at least a discrepancy between v4 and v6 that should be fixed.

nc -zv cloudweb1003.wikimedia.org 7443
nc: connect to cloudweb1003.wikimedia.org (2620:0:861:2:208:80:154:150) port 7443 (tcp) failed: Connection refused
Connection to cloudweb1003.wikimedia.org (208.80.154.150) 7443 port [tcp/*] succeeded!

Change 816171 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: close down cloudweb envoy port

https://gerrit.wikimedia.org/r/816171

Change 816026 merged by Ladsgroup:

[operations/puppet@production] P:mariadb::grants: add cloudweb1003/1004 grants

https://gerrit.wikimedia.org/r/816026

Change 816171 merged by Andrew Bogott:

[operations/puppet@production] hieradata: close down cloudweb envoy port

https://gerrit.wikimedia.org/r/816171

@Andrew what do you need one with these? The task was re-opened and I see some action but not sure what the current status is at the moment.

@Andrew what do you need one with these? The task was re-opened and I see some action but not sure what the current status is at the moment.

I don't need anything, these are in service and working fine. Arzhel re-opened over a firewall concern which I'm looking at.

That firewall issue should be sorted with my latest patch above.