Page MenuHomePhabricator

Q1:(Need By: TBD) rack/setup/install cloudswift100[12]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudswift100[12]

Hostname / Racking / Installation Details

Hostnames: cloudswift1001, cloudswift1002
Racking Proposal: Use WMCS dedicated 10G racks. These can be placed in any WMCS rack and co-exist with any WMCS service. However, please place each in a separate rows from one another.
Networking/Subnet/VLAN/IP: two 10G connections. Current networking not certain, possibly cloud-hosts1-eqiad for primary and another vlan (non wmcs?) for secondary connection. This will need to be determined before the systems arrive on-site. - see below networking section
Partitioning/Raid: standard, raid1-2dev
OS Distro: Bullseye
Technical Contact: @aborrero

Networking Details

Discussion on purchase task T286586 denotes this is a new service, and the networking requirements are not entirely known at time of system order placement. The networking will need to be determined before the hosts arrive (approximately 20 days), so both cloud-services-team (Hardware) and netops have been added as project tags, and the relevant users subscribed at time of task creation. The discussion on the purchase task assumes this will need to use both of its 10G ports, with one likely in the cloud hosts vlan (primary port) and one likely in another vlan (unknown at this time.) This service will be consuming ceph/rbd and presenting it to the public internet.

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudswift1001:

  • - receive in system on procurement task T286586 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudswift1002:

  • - receive in system on procurement task T286586 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@cmooney These host have come in and racked unless something has changed and these racks are correct please assign to @Cmjohnson

cloudswift1001 Rack,C8 U35. port {cloudsw2-c8-eqiad} 5,18 cableid. 11059 / 11061
cloudswift1002 Rack,D5 U33 port {cloudsw1-d5-eqiad} 4, 5 cableid 11060 / 11062

@aborrero is it possible to have more information on this new service? Design doc or similar. I can't find anything on Wikitech.

I want to make sure we don't get into a XY problem as well as document why we configured its network that way for future references.

Ideally an high level overview of what it does, who/what will interact with it, bandwidth needs, how it will scale, etc.

@aborrero is it possible to have more information on this new service? Design doc or similar. I can't find anything on Wikitech.

I want to make sure we don't get into a XY problem as well as document why we configured its network that way for future references.

Ideally an high level overview of what it does, who/what will interact with it, bandwidth needs, how it will scale, etc.

Find more information here: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/cloudswift

I just created that and I'm still iterating over it, but you can see most relevant information already.

Thanks for the doc, some follow up questions to make sure I understand it properly.

However, like the openstack APIs that live on cloudcontrol servers, we would like to expose the swift API to Cloud VPS VMs (and eventually the internet at large)

As this is the points that drives the necessity of a public IP (and exposes a new service on the Internet, with all the risks it comes with), it would be useful to detail why the internet at large needs access to this endpoint.

As they're 2 servers, how will HA be managed between the two? Active/passive? will they share a VIP?

Before deploying a new vlan, could existing and well tested tools be leveraged?

For example having the servers in the 172.16/12 space and their public IP NATed to 185.15.56.0/25 ? Possibly through Neutron.

Or could the LVS front the public VIP (in our regular VIPs pool), forwarding traffic to the cloudswift interface on the cloud-host vlan?

I take it the main concern here is allocating a public IPv4 address, which is a scarce resource, no?
It seems we have a reserved block 185.15.56.128/26 (https://netbox.wikimedia.org/ipam/prefixes/3/). Couldn't we just subnet that one and perhaps allocate 185.15.56.128/28 for this new vlan?

Anyway, replies inline. Feel free to copy/paste from here to https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/cloudswift if you need the information at hand for future reference.

Thanks for the doc, some follow up questions to make sure I understand it properly.

However, like the openstack APIs that live on cloudcontrol servers, we would like to expose the swift API to Cloud VPS VMs (and eventually the internet at large)

As this is the points that drives the necessity of a public IP (and exposes a new service on the Internet, with all the risks it comes with), it would be useful to detail why the internet at large needs access to this endpoint.

One of the main use cases why people use our cloud is to host internet-facing web services and other tools (like APIs and such).
Our plan is to use swift to store static object blobs. Some of the things that are usually stored that way are website static assets, like images, css files, fonts, javascript libraries, etc.

We have plenty of tools and utilities that could benefit from the swift API today, to name a few:

Having the API open to the internet is one of the main uses cases of swift. Objects stored in swift can be directly requested using the API, something like this:

https://swift.openstack.eqiad1.wikimediacloud.org/v1/{account}/{container}/{object}

for example:

https://swift.openstack.eqiad1.wikimediacloud.org/v1/12345678912345/images/flowers/rose.jpg

More information on how the swift API works can be seen:

Having this functionality enabled in our cloud is almost a mandatory step to evolve and move our services in the direction of modern technology / cloud offerings.

We are, in general, aware of the risks associated with running internet-facing services. I say this in the context of our plan, which is to work on iterations:

  • work first to introduce swift to cloud-internal only clients (i,e, firewall the access from the internet on the initial deployment)
  • gain understanding of the service, how to operate it, etc
  • open it to the public internet

As they're 2 servers, how will HA be managed between the two? Active/passive? will they share a VIP?

Our initial approach will be a simple active/passive approach.

We could have a keepalived/VRRP and only consume 1 public IPv4 address (plus another in the gateway, which would be the cloudgw router).

Before deploying a new vlan, could existing and well tested tools be leveraged?

For example having the servers in the 172.16/12 space and their public IP NATed to 185.15.56.0/25 ? Possibly through Neutron.

This is something that can definitely be done. However, that short sentence hides many hidden traps that we want to avoid for now.

Or could the LVS front the public VIP (in our regular VIPs pool), forwarding traffic to the cloudswift interface on the cloud-host vlan?

Not an option we want to consider at this point.

We may need un-NATed traffic from cloud VMs to the API endpoints. Using LVS as you describe would be on violation of the cross-realm traffic guidelines.

Moreover, we have been told this is undesirable several times:

  • purpose separation: general LVS would ideally not be used to host cloud-dedicated resources.
  • reputation: running cloud-dedicated resources on IPv4 pools that are dedicated to host the wikis is undesirable.
  • DNS separation: associating cloud-dedicated services with the .wikimedia.org domain is undesirable. We should use ....wikimediacloud.org instead.

We don't have established/format policies for the last 3 bits. Shall we work on having them? :-P

Thanks!

I take it the main concern here is allocating a public IPv4 address, which is a scarce resource, no?

That's one of them, but not the most important one to me.

We currently have well maintained and well tested tool and processes to expose services internally and externally. They offers some combinations of security, monitoring, HA, and have staff experienced with them.

Introducing a "new way" of exposing a service means having to re-implement some of those mechanisms, as well as maintaining them for a long time.
Which means increasing our attack surface as well as SRE workload.

My questions are to make sure we studied all the existing ways, and deemed (+documented) them non-suitable, before we introduce a new one.
And then if a new one is needed, gather (and document) all the short/medium/long term goals/design to make sure it's as future proof as possible (thus my questions about HA for example).

I hope that clarifies my thought-process.

Our initial approach will be a simple active/passive approach.

What's the ideal end state?

However, that short sentence hides many hidden traps that we want to avoid for now.

Could we detail/document them?
"For now" means this might change?

We may need un-NATed traffic from cloud VMs to the API endpoints.

Why un-NATed is needed?
I'd worry that allowing un-NATed traffic put us back in a spot with a too-tight integration between VMs and hosts outside the cloud-instance vlan. Is that a risk?

Not an option we want to consider at this point.

I agree overall but want us to be explicit on why we shouldn't go that way. as you did. Thanks!

Which means increasing our attack surface as well as SRE workload.

Sorry, I'm having problems connecting the dots here.

  • On the server management side (control plane), nothing will change (i.e, icinga, install servers, DNS, puppet, netbox, etc). This is just another server, and I don't see how this increases attack surface or SRE workload.
  • On the public service side (data plane) the request here is to allocate a new vlan/subnet from the an IPv4 range/pool that has traditionally been used by WMCS (185.15.x.x). The new vlan/subnet will be "cloud realm", which has traditionally been managed by WMCS.

Other than the act of documenting the setup (already done), creating the netbox allocations and setting up the network bits (i.e, setting up cloudsw devices), I don't see how this generates new attack surfaces or workload for the SRE teams.
Well, creating one vlan on cloudsw and maintaining it is definitely a workload, but it shouldn't be a big deal, no?

To be clear: exposing this service (data plane) will be managed and maintained by WMCS. We won't be using any facility under SRE responsibility other than the edge routing (cloudsw, core routers..), which need little changes...

Could you please elaborate on the new attack surfaces and the SRE workloads that concerns you?

I'll try and sum up what my thought process on this was.

Firstly the security consideration is that we will have cloudswift servers connected to the cloud-hosts1-eqiad vlan on one interface, and directly to the public internet (via cloudgw) on another. That situation means that we are reliant on the admins of the cloudswift and cloudgw servers (both wmcs) to properly take care of security and network isolation.

To my mind this does not change the security assessment versus what's already in place. Cloudgw1001, for instance, has a leg in cloud-hosts1-eqiad, and another on a publicly routable network, cloud-instance-transport1-b-eqiad. So the same dependency to properly segment and isolate the networks already exists, and lies with the same team (wmcs).

In the existing case isolation is done on that host using the Linux VRF / l3mdev mechanism. On the new cloudswift devices I presume IP forwarding will be disabled which should largely take care of it. Both need proper firewalling and hardening to prevent malicious connections from the internet of course.

@ayounsi if I've missed something here please advise. But my overall thinking was that making these changes would not introduce any new security consideration. SRE already trust WMCS to manage hosts connected to both the cloud-hosts vlan and the public internet.

Apart from the security side of things, the guidance in SRE has been to treat WMCS as a "separate entity", something like a hosted customer. Perhaps a hosted customer that is our friend and we trust, but you get the idea. So whether the routing plan for cloudswift is the most optimal, or if it could be done with 1 NIC rather than 2 or whatever, is not really something I felt the need to comment on. That's a matter for WMCS. As long as we were not reducing the overall network security I figured it was up to them.

@cmooney, I agree with your take on the security aspect.

We're not in a typical service provider (ISP)/customer relationship, where the customer does whatever they want.
We need to work together (SRE/WMCS) to figure out what's the best approach in term of networking for any new service on the WMF network. Even more so if it's publicly reachable, even more so if it's out of our standard practices.

My questions are to understand and more broadly document how this new service will work, so we can identify security, scalability, maintainability, and overall design pitfalls ahead of time. For example not have to redesign the service's networking in a few months/years, as well as being able to recall in the future why it has been designed this way.

We've made good progress since the task creation, from no documentation to a draft doc. My questions in the previous comments are to address what I think are still blind spots.

From a certain point of view what we're doing here is validating case 4 in the cross-realm traffic guidelines. Part of the goal of the document was to clarify the architecture on a wide scope, to reduce this kind of friction per-project/per-server/per-idea etc.

We have an incoming network sync meeting WMCS/SRE-IF on 2021-11-24. I propose we make this topic the main agenda point of that meeting.

aborrero changed the task status from Open to Stalled.Dec 14 2021, 5:57 PM

FYI network details for these servers are blocked on T296411: cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet, which is in turn stalled, so marking this one the same.

What is the status on this one? it has been sitting for a while

@Jclark-ctr These are blocked on a variety of tech decisions; no action needed in the DC for now. Thanks for checking in!

As an update, this is now blocked on T297596: have cloud hardware servers in the cloud realm using a dedicated LB layer. The previous implementation discussion led to a finalization of guidelines, which are now published in https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_guidelines#Case_4:_cloud-dedicated_hardware. This implementation will be conformant with those guidelines, which requires T297596, cloud based load balancers to exist. The network requirements are likely to shift to a new cloud-private + cloud-hosts VLAN setup, but otherwise I don't anticipate any changes required related to network or racking.

@Jclark-ctr see @nskaggs comment above it looks like we good to move on with the racking /network and setup of this task i will be taking over this task but before can you please move cloudswift1001 from cloudsw2-c8 to cloudsw1-c8
Thanks

cloudswift1001 Rack,C8 U35. port {cloudsw2-c8-eqiad} 5,18 cableid. 11059 / 11061
cloudswift1002 Rack,D5 U33 port {cloudsw1-d5-eqiad} 4, 5 cableid 11060 / 11062

Verified cables for both Servers below are the ports and cable ids @Papaul

cloudswift1001 Rack,C8 U35. port {cloudsw1-c8-eqiad} 5,18 cableid. 11059 / 11061
cloudswift1002 Rack,D5 U33 port {cloudsw1-d5-eqiad} 31, 36 cableid 11060 / 11062

@Jclark-ctr there is no network cable connected to both nodes.

xe-0/0/5        up    down cloudswift1001 {#11059}
xe-0/0/31       up    down cloudswift1002 {#11060}

@Papaul Cables where connected to correct ports. i did swap cables while verifying

Replaced Cable new cableid230304500295
xe-0/0/5 up down cloudswift1001 {#11059}
Replaced cable new cableid 230304500165
xe-0/0/31 up down cloudswift1002 {#11060}.

@Papaul, note that these hosts are still pending some trial work in codfw1dev so you shouldn't spend any effort on these hosts until we unblock you.

@Jclark-ctr i check again those servers from the switch side see below.
Those are using NON-JNPR compatible cables. that is may be the reason both servers link is showing down. Can you please check and possible use something compatible with Juniper.
Thanks

Xcvr 31               NON-JNPR     230304500165      SFP+-10G-CU2M
Xcvr 5                NON-JNPR     230304500295      SFP+-10G-CU2M

Replaced both cables. they where newer wave2wave dac cables

Xcvr 31      REV 01   740-030077   H70824500300      SFP+-10G-CU3M
Xcvr 5       REV 01   740-030077   G1807123036-1      SFP+-10G-CU3M
xe-0/0/5        up    up   cloudswift1001 {#11059}
xe-0/0/31       up    up   cloudswift1002 {#11060}

@Andrew yes we can still do the os install part and resolve this task when we will will be ready to do network changes we can open another task.
Thanks

Change 919245 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add cloudswift100[1-2] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/919245

Change 919245 merged by Papaul:

[operations/puppet@production] Add cloudswift100[1-2] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/919245

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS buster executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Papaul added a subscriber: Jhancock.wm.

@Jhancock.wm was trying to install the OS on cloudswitf1001 and the server was not getting DHCP after taking a look on the switch, the switch was not learning any MAC address on the first 10G nic but switch was learning a MAC address on the second 10G nic.
@Jhancock.wm you can work with @Jclark-ctr next week to move the cable to the first nic and once that done you can restart the re-image cookbook. I will assign you this task.

Thanks

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ was alerting with:

cloudswift1001 (WMF5069) Device is in PuppetDB but is Planned in Netbox (should be Active or Failed)

I believe it's an oversight and based on the above comment I set it to "failed".

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Change 925850 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] fix typo for cloudswift100[1-2] in site.pp

https://gerrit.wikimedia.org/r/925850

Change 925850 merged by Papaul:

[operations/puppet@production] fix typo for cloudswift100[1-2] in site.pp

https://gerrit.wikimedia.org/r/925850

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye completed:

  • cloudswift1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306011607_jhancock_3172209_cloudswift1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

@Jclark-ctr when you have a moment/back can you swap the ports on the NIC for cloudswift1002? thanks!

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudswift1002.eqiad.wmnet with OS bullseye

Swapped cables on cloudswift1002

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudswift1002.eqiad.wmnet with OS bullseye completed:

  • cloudswift1002 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306081352_jclark_4023631_cloudswift1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye completed:

  • cloudswift1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306090204_pt1979_1766728_cloudswift1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Papaul updated the task description. (Show Details)

This is complete