Page MenuHomePhabricator

Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudcontrol100[6-7].wikimedia.org

Hostname / Racking / Installation Details

Hostnames: cloudcontrol100[6-7].wikimedia.org
Racking Proposal: Place in separate rows. Can't be placed in E/F.
Networking/Subnet/VLAN/IP: 1 10G connections per server. Requires public VLAN
Partitioning/Raid: all drives in hw RAID10; partman/standard.cfg partman/hwraid-1dev.cfg
OS Distro: Bullseye

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcontrol1006:
  • - receive in system on procurement task T303440 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudcontrol1007:
  • - receive in system on procurement task T303440 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH created this task.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added subscribers: nskaggs, Jclark-ctr, Andrew.

@Andrew or @nskaggs,

When this info for racking was filled out by @nkaggs, it included using the FQDN of .eqiad.wmnet and a public vlan? Usually its one or the other.

Looking at cloudcontrol hosts currently in eqiad, they are using the public vlan so are cloudcontrol1005.wikimedia.org, not eqiad.wmnet.

Can you advise and update this task's description accordingly and once updated, assign to @Jclark-ctr for racking when they arrive?

RobH added a parent task: Unknown Object (Task).Apr 25 2022, 11:42 PM

Please note if they do require public IP addresses, please tag in Arzhel as a subscriber so he is aware of the request.

Andrew renamed this task from Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].eqiad.wmnet to Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].wikimedia.org.Apr 26 2022, 3:20 AM
Andrew updated the task description. (Show Details)
Andrew added a subscriber: ayounsi.

You're right, these will need public IPs (but with luck we'll free up the old ones shortly after these go online)

RhinosF1 renamed this task from Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].wikimedia.org to Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org.Apr 26 2022, 5:47 AM

@Andrew, thanks, I'm still in my quest of reducing our public vlans usage ;)
Could those hosts use private IPs (live in the private vlan) and be fronted by LVS instead?

I'm also interested in knowing how they work, will both servers be a redundant pair? if so how does failover work? Do they need to make outbound connections?

I'm also interested in knowing how they work, will both servers be a redundant pair? if so how does failover work? Do they need to make outbound connections?

This task is for a refresh of existing hosts: cloudcontrol1003.wikimedia.org and cloudcontrol1004.wikimedia.org. I don't plan to change any of the existing behavior as part of this hardware refresh. Do you need a full rundown of how the existing cluster works?

As services, infra and best practices changes over time (and through the 5 years servers lifetime) it's possible that the way they're setup are not optimal anymore, introducing some forms of technical debt (here for example is the use of the public vlan).

That's why I wanted to have a quick exchange as they're being refreshed, to understand how they work and if their need of being in the public vlan could be challenged. From there I see 3 options:
1/ They have a hard requirement for being in the public vlan, and if it's not already documented it's a good opportunity to do so (I dug through wikitech with no luck)
2/ Their current setup could maybe change, but would takes time/efforts, in that case I'd open a follow up task to discuss it more in details
3/ It could change and it's actually a low hanging fruit, eg. they could be provisioned directly as private hosts

From you reply I guess it would not be a (3).
To answer your question, I do not need a full rundown, but it would be helpful if you could point me to some doc on the traffic flows and overall HA of those hosts.

There are three main flows that currently utilize the public IPs of cloudcontrols (unless I'm missing something):

  1. OpenStack API traffic from cloud and possibly in the future from the wider internet. This is HTTP(s) traffic on various ports and could possibly be replaced with an LVS service (although @aborrero might have Opinions on using LVS for WMCS hardware). Currently it uses the ´openstack.eqiad1.wikimediacloud.org´ DNS name which is usually a CNAME to one of the cloudcontrols.
  2. RabbitMQ traffic from Trove instances, to tcp/5671 and tcp/5672 on each node. This can't be trivially be replaced with LVS but needs to be looked soon anyways if we're getting dedicated Rabbit nodes.
  3. Admin CLI tools making requests to dynamicproxy and enc apis on the cloud realm. Most of this traffic is coming from Horizon (labweb*) or Designate (cloudservices*) hosts, but the admin CLI tools currently live on cloudcontrols. This can use the production HTTP proxies after T305453 is done.

I agree with @Majavah's assessment, although I wouldn't promise that there aren't other edge cases where we are relying on the IPs being public. It's most likely case 2 although getting a proxy in front of everything feels like a big task.

I should also note that there are always three cloudcontrols -- this task is about refreshing two of the existing three hosts. So if we rearrange the setup as part of this refresh we will also have to retrofit cloudcontrol1005.

@ayounsi please also be aware that our team of five SREs is currently down to three. This means we will have, if anything, negative capacity to take on unplanned work for the next couple of quarters.

I just noticed that this is still assigned to me! I don't think there any action items left for me that are specific to the new hardware so I'm bouncing this back to the eqiad folks.

cloudcontrol1006 B2 U11 Cableid 20220204 Port 28
cloudcontrol1007 D2 U1 Cableid 20220203 Port 21

@ayounsi @Andrew Has a determination on public vs private VLAN been decided? Also, @Andrew which partman recipe do these require? Are they hardware raid10?

public vlan, just like the existing cloudcontrols please.

All disks in hardware raid10, and then partman recipe 'partman/standard.cfg partman/hwraid-1dev.cfg'.

Change 812020 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding cloudcontrol1006-7 and fix wmcs site.pp entries

https://gerrit.wikimedia.org/r/812020

Change 812020 merged by Cmjohnson:

[operations/puppet@production] Adding cloudcontrol1006-7 and fix wmcs site.pp entries

https://gerrit.wikimedia.org/r/812020

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.org with OS bullseye executed with errors:

  • cloudcontrol1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.org with OS bullseye executed with errors:

  • cloudcontrol1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1007.wikimedia.org with OS bullseye completed:

  • cloudcontrol1007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071847_cmjohnson_1650657_cloudcontrol1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)
Cmjohnson added a subscriber: dcaro.

these are finished @dcaro

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.org with OS bullseye completed:

  • cloudcontrol1006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071915_cmjohnson_1658620_cloudcontrol1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Change 814890 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make cloudcontrol100[67] into live cloudcontrol nodes

https://gerrit.wikimedia.org/r/814890

Change 814890 merged by Andrew Bogott:

[operations/puppet@production] Make cloudcontrol100[67] into live cloudcontrol nodes

https://gerrit.wikimedia.org/r/814890

Change 814895 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] acme_chief: allow access for cloudcontrol100[67]

https://gerrit.wikimedia.org/r/814895

Change 814895 merged by Andrew Bogott:

[operations/puppet@production] acme_chief: allow access for cloudcontrol100[67]

https://gerrit.wikimedia.org/r/814895

These hosts are now in service and seem to be working.