⚓ T305414 Q4:(Need By: TBD) rack/setup/install cloudweb100[34]

Subject	Repo	Branch	Lines +/-
hieradata: close down cloudweb envoy port	operations/puppet	production	+1 -0
P:mariadb::grants: add cloudweb1003/1004 grants	operations/puppet	production	+46 -0
Put cloudweb100[34] into service	operations/puppet	production	+27 -1
updating site.pp for cloudweb servers, setup incorrectly for private vlan	operations/puppet	production	+1 -1
Adding cloudweb1003/4 to site.pp and netboot.cfg	operations/puppet	production	+6 -0

RobH renamed this task from (Need By: TBD) rack/setup/install cloudweb100[34] to Q4:(Need By: TBD) rack/setup/install cloudweb100[34].Apr 4 2022, 10:41 PM

RobH created this task.

RobH mentioned this in Unknown Object (Task).

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

RobH moved this task from Backlog to Racking / Decom on the cloud-services-team (Hardware) board.

RobH added a parent task: Unknown Object (Task).

Maintenance_bot added a project: SRE.Apr 4 2022, 11:29 PM

Comment moved from T303424#7830897:
From https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy#cloud[lab]web*

These hosts are behind the misc varnish cluster and could be considered for moving into private address space at a later date.
TODO: It is unclear whether current defined best practice requires these hosts to be in the public address space. They are now because of connectivity requirements. Cloud[lab)web* requires the ability to query nova-api which is restricted from private production VLANs. Because of this requirement they are in public address space. The cloudcontrol* refresh and Neutron deployment moved the nova-api service to cloudcontrol* hosts instead of cloudnet*. This should be reevaluated.

@Andrew Maybe I can help with this? We're trying to reduce our public vlan footprint in favor of load balancers front-ends, those hosts might be good candidates as they are already behind LVS :)
Did the situation change since the doc was written?
What kind of flows are the the nova-api queries? (source, dest, protocols, etc)
For example the cloudcontrol hosts are in the public vlan, so hosts in the private vlan can reach them no problem.
We also have the webproxies for the private hosts to use to query endpoints outside of prod.

See comment in T303424#7830897, they *might* be able to go in any 10G rack, private vlan.

Regardless, those are prod hosts (public/private) so they should not go in WMCS racks.

Cloudwebs need access to various OS APIs. Most of them are hosted in the production realm and should be accessible from any production VLAN without cross-realm traffic issues.

The issue comes from the Puppet ENC and webproxy APIs which are hosted on Cloud VPS VMs (in the cloudinfra and project-proxy projects, respectively). Right now that's done by using public vlans. Using the squids might be possible after the legacy IP-based access control those use is changed to proper keystone authentication (T295234, T274666).

Noted! To keep track of the IRC conversation, echoing it here:

is that a hard blocker? or could it be fixed before those hosts are live? It would be nice to not block public IPs until the next refresh :)

Edit: note that there is also work being done towards adding ACLs on the proxies themselves, which could help here (see T300977). But so far it's only at the discussion phase.

taavi mentioned this in T305453: Horizon should use a proxy to access cloud vps hosted apis.Apr 5 2022, 11:15 AM

FYI, @ayounsi, our mid-term goal is to eliminate the need for this hardware entirely.

Wikitech needs to move to the mediawiki cluster, somehow (T237773)
Once openstack APIs are fully opened up (T267194 + some security thoughts) Horizon can run on a VM or a general-purpose k8s cluster
Striker can probably run on a VM whenever we feel like making the move.

Because of that long-term goal I don't think we should spend much time refactoring the current hardware-based deployment pattern.

Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned? (I even see patches ready for reviews!) If so that would be greatly appreciated. Of course I understand it's difficult to compare the workload/benefits that adds/gain in the big WMF picture :)

In T305414#7858338, @ayounsi wrote:

Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned? (I even see patches ready for reviews!) If so that would be greatly appreciated. Of course I understand it's difficult to compare the workload/benefits that adds/gain in the big WMF picture :)

Please don't take anything I am working on my free time as any sort of official WMCS team priority :-) In this case T274666 and its subtasks had been in my todo list on a while and https://gerrit.wikimedia.org/r/c/operations/puppet/+/781950 was easily combined with other clean-up that needs to happen there. But yes, as long as people review my patches this should be fully doable in the short term.

In T305414#7858338, @ayounsi wrote:

Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned?

Mostly, I'd like work that's spent on these three deployments (striker/horizon/wikitech) to be directed towards the final goal rather than towards temporary side-trips. The work that Taavi is doing certainly would be moving us towards the long-term end state, but basically any time I spend thinking about wikitech /on/ labweb hosts feels wasted.

There is some oauth and 2fa interaction between striker and wikitech that may make separating striker more complicated than I'm imagining; cc'ing @bd808 for his thoughts on that since he's already working on other Striker projects.

RobH unsubscribed.May 2 2022, 10:16 PM

cloudweb1003 c8 u39 20220099 port 10 (cloudsw2-c8-eqiad)
cloudweb1004 d5 u16 20220109 port10 (cloudsw2-d5-eqiad)

Jclark-ctr updated the task description. (Show Details)Jun 28 2022, 8:09 PM

Jclark-ctr reassigned this task from Jclark-ctr to • Cmjohnson.Jun 28 2022, 9:20 PM

Jclark-ctr updated the task description. (Show Details)

Jclark-ctr subscribed.

@ayounsi @taavi @Andrew Has a determination on public vs private VLAN been decided?

@Cmjohnson you should continue to use public VLAN for this. Decisions around changing the architecture of the service shouldn't delay refreshing these machines.

Agreed!

However,

In T305414#8034695, @Jclark-ctr wrote:

cloudweb1003 c8 u39 20220099 port 10 (cloudsw2-c8-eqiad)
cloudweb1004 d5 u16 20220109 port10 (cloudsw2-d5-eqiad)

Doesn't match my previous comment:

In T305414#7830906, @ayounsi wrote:

Regardless, those are prod hosts (public/private) so they should not go in WMCS racks.

@Jclark-ctr Can you move these servers out of wmcs rack and into a 10G rack. there is space in B2, D2, D7

Thank you @ayounsi !

@ayounsi host will be moved tomorrow morning When i started racking task i went by Racking Proposal: Place in WMCS racks. Place in separate rows. Can't be placed in E/F.

Which was not updated

cloudweb1003 B2 u3 20220205 port34
cloudweb1004 D2 u33 20220109 port33

• Cmjohnson claimed this task.Jul 7 2022, 7:36 PM

Change 813697 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding cloudweb1003/4 to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/813697

gerritbot added a project: Patch-For-Review.Jul 13 2022, 7:08 PM

Change 813697 merged by Cmjohnson:

[operations/puppet@production] Adding cloudweb1003/4 to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/813697

Maintenance_bot removed a project: Patch-For-Review.Jul 14 2022, 4:30 PM

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye executed with errors:

cloudweb1003 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye executed with errors:

cloudweb1003 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye executed with errors:

cloudweb1004 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- The reimage failed, see the cookbook logs for the details

From diffscan, those two hosts have their SSH port exposed to the world:

New Open Service List
---------------------
STATUS HOST PORT PROTO OPREV CPREV DNS
OPEN 208.80.154.150 22 tcp 0 6 cloudweb1003.wikimedia.org
OPEN 208.80.155.117 22 tcp 0 6 cloudweb1004.wikimedia.org

Change 814185 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] updating site.pp for cloudweb servers, setup incorrectly for private vlan

https://gerrit.wikimedia.org/r/814185

gerritbot added a project: Patch-For-Review.Jul 15 2022, 3:14 PM

Change 814185 merged by Cmjohnson:

[operations/puppet@production] updating site.pp for cloudweb servers, setup incorrectly for private vlan

https://gerrit.wikimedia.org/r/814185

Maintenance_bot removed a project: Patch-For-Review.Jul 15 2022, 3:31 PM

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye completed:

cloudweb1003 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207151719_cmjohnson_3607796_cloudweb1003.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye completed:

cloudweb1004 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207151731_cmjohnson_3610958_cloudweb1004.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

• Cmjohnson closed this task as Resolved.Jul 15 2022, 6:56 PM

• Cmjohnson updated the task description. (Show Details)

Change 815378 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Put cloudweb100[34] into service

https://gerrit.wikimedia.org/r/815378

gerritbot added a project: Patch-For-Review.Jul 19 2022, 8:38 PM

Change 815378 merged by Andrew Bogott:

[operations/puppet@production] Put cloudweb100[34] into service

https://gerrit.wikimedia.org/r/815378

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudweb1003.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudweb1004.wikimedia.org with OS buster

I'm reimaging these with Buster because mediawiki isn't really supported on Bullseye yet.

Maintenance_bot removed a project: Patch-For-Review.Jul 21 2022, 3:30 PM

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudweb1003.wikimedia.org with OS buster completed:

cloudweb1003 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207211516_andrew_2639298_cloudweb1003.out
- Checked BIOS boot parameters are back to normal
- Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudweb1004.wikimedia.org with OS buster completed:

cloudweb1004 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207211516_andrew_2639302_cloudweb1004.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Change 816026 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:mariadb::grants: add cloudweb1003/1004 grants

https://gerrit.wikimedia.org/r/816026

gerritbot added a project: Patch-For-Review.Jul 21 2022, 6:18 PM

From diffscan:

STATUS HOST PORT PROTO OPREV CPREV DNS
OPEN 208.80.154.150 7443 tcp 0 6 cloudweb1003.wikimedia.org
OPEN 208.80.155.117 7443 tcp 0 6 cloudweb1004.wikimedia.org

Not sure if expected, but there is at least a discrepancy between v4 and v6 that should be fixed.

nc -zv cloudweb1003.wikimedia.org 7443
nc: connect to cloudweb1003.wikimedia.org (2620:0:861:2:208:80:154:150) port 7443 (tcp) failed: Connection refused
Connection to cloudweb1003.wikimedia.org (208.80.154.150) 7443 port [tcp/*] succeeded!

Change 816171 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: close down cloudweb envoy port

https://gerrit.wikimedia.org/r/816171

Change 816026 merged by Ladsgroup:

[operations/puppet@production] P:mariadb::grants: add cloudweb1003/1004 grants

https://gerrit.wikimedia.org/r/816026

Andrew added a subtask: T313861: decommission labweb1001 and labweb1002.Jul 26 2022, 9:23 PM

Change 816171 merged by Andrew Bogott:

[operations/puppet@production] hieradata: close down cloudweb envoy port

https://gerrit.wikimedia.org/r/816171

Maintenance_bot removed a project: Patch-For-Review.Jul 26 2022, 10:30 PM

@Andrew what do you need one with these? The task was re-opened and I see some action but not sure what the current status is at the moment.

In T305414#8113020, @Cmjohnson wrote:

@Andrew what do you need one with these? The task was re-opened and I see some action but not sure what the current status is at the moment.

I don't need anything, these are in service and working fine. Arzhel re-opened over a firewall concern which I'm looking at.

That firewall issue should be sorted with my latest patch above.

Yep, it's closed!

taavi mentioned this in T314152: Unable to list web proxies via horizon UI: "upstream request timeout".Jul 29 2022, 3:57 PM

• Cmjohnson closed subtask T313861: decommission labweb1001 and labweb1002 as Resolved.Aug 16 2022, 4:16 PM

Q4:(Need By: TBD) rack/setup/install cloudweb100[34]
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

cloudweb1003:

cloudweb1004:

Details

Related Objects
Search...

Event Timeline

Status	Subtype	Assigned	Task
			Unknown Object (Task)
Resolved		Andrew	T305414 Q4:(Need By: TBD) rack/setup/install cloudweb100[34]
Resolved	Request	• Cmjohnson	T313861 decommission labweb1001 and labweb1002
Resolved		Marostegui	T314528 Revoke MariaDB grants for labweb1001/1002

Q4:(Need By: TBD) rack/setup/install cloudweb100[34]Closed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

cloudweb1003:

cloudweb1004:

Details

Related ObjectsSearch...

Event Timeline

Q4:(Need By: TBD) rack/setup/install cloudweb100[34]
Closed, ResolvedPublic
Actions

Related Objects
Search...