Page MenuHomePhabricator

VMs requested for stewards
Closed, ResolvedPublic

Description

Site/Location: eqiad or codfw ? both
Number of systems: 1 should be more than enough
Service: Wikimedia Stewards automatizations that are ineligible under Wikimedia Cloud VPS's terms of use.
Networking Requirements: internal IP
Processor Requirements: 1 or 2 should be sufficient.
Memory: 2 GB
Disks: 20 GB
Other Requirements: Selected Wikimedia Stewards (and/or steward-approved non-stewards) to have root-level access (new access group needed, for now, probably me / @Urbanecm).

Detailed reasoning

Wikimedia Stewards have several workflows that have the following characteristics (below is an example workflow described):

  • are easy to automate,
  • are used frequently enough for automatization to have visible impact,
  • allow direct access to (or operate with) Nonpublic personal information and Personal information as defined by relevant WMF policies (Privacy policy or Confidentiality agreement) without explicit consent of the user(s) the data is about, and as such, are bound by the restrictions set by the Privacy policy and/or ANPDP.

Because the third point, it is currently impossible to experiment with possible automatization within WMF premises, as there is no suitable production machine and as far as I know, processing Privacy policy-protected data is prohibited in Wikimedia Cloud by their ToU (in particular, the ToU make it explicit that there are no guarantees in terms of WMCS security, which seems to be incompatible with the Privacy policy-set expectations).

This can be illustrated with automating (on/off)boarding for community functionaries (described in more detail below), which is the first project I'd like to use the machine for. For a system to be able to automatically provision required accesses for functionaries, the system necessarily needs to have credentials allowing to grant/revoke said accesses. This also means that any such system would have direct access to virtually all private data that the WMF exposes to trusted functionaries, begining with user IP data and ending with security reports. Restricting such access would be impractical or impossible, because the system's purpose is to perform the permission adjustments and needs to have the rights to do so.

Production VM seems to be a reasonable place for such on/offboarding scripts to live. I'm opening this request to start an initial conversation with SRE and stewards, about whether having a production machine would even be an option, or whether there are other solutions that are more suitable for the problem I'm proposing to solve here.

Please let me know if there is a better place to run a discussion like this. I'm also happy to discuss the needs we (Stewards) have synchronously, if that would be benefitial.

Example Steward Workflow

The most important workflow that could be automated without signficiant effort (assuming environment where private data can be accessed safely) is (on/off)boarding community functionaries. Community functionaries tend to have access to several resources that need to be enabled/disabled individually, in addition to the on-wiki permission group. Many of those resources include access to privileged data, which cannot be (as explained above) maintained from Cloud. Examples include:

  • Private wikis, such as checkuser.wikimedia.org (contains user IP data), steward.wikimedia.org (contains miscellaneous WMF confidential data) or vrt-wiki.wikimedia.org (contains excerpts from VRTS and other WMF confidential data)
  • Private Mailman lists (such as stewards-l, checkuser-l, global-sysops, global-renamers, ...); some of them are frequently used for deliberations involving WMF confidential data
  • Private IRC channels (#wikimedia-checkuser, #wikimedia-privacy, ...); some of them are frequently used for deliberations involving WMF confidential data
  • Phabricator ACLs, such as acl*security_steward or acl*stewards, which provide access to sensitive Phabricator tasks.
  • Secondary on-wiki user groups (for example, steward permission compose of the steward Meta-Wiki group and of the steward global CentralAuth-provided group; both need to be granted to make an user an actual steward).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

From a high level view that seems perfectly fine. We initiate non-wiki offboardings from the production networks in a similar manner. I'll add this to the agenda of the next Infrastructure Foundations SRE meeting (next Monday) to have a wider discussion in the team.

It would however imply that any steward who wants to use that VM would need to sign a volunteer NDA (since that is needed for access to Wikimedia production servers). But we can centainly start with yourself and then expand with additional interested stewards when the process has been established.

My other question would be around the scripts which will run from this VM, it's my understanding that these are currently still to be written and that current offboarding of functionaries is a fully manual process, is that correct?

From a high level view that seems perfectly fine. We initiate non-wiki offboardings from the production networks in a similar manner. I'll add this to the agenda of the next Infrastructure Foundations SRE meeting (next Monday) to have a wider discussion in the team.

Thanks!

It would however imply that any steward who wants to use that VM would need to sign a volunteer NDA (since that is needed for access to Wikimedia production servers). But we can centainly start with yourself and then expand with additional interested stewards when the process has been established.

This is understood.

My other question would be around the scripts which will run from this VM, it's my understanding that these are currently still to be written and that current offboarding of functionaries is a fully manual process, is that correct?

Correct. I logged a VM request first, to learn whether doing this in production would seem like a good idea to the SREs.

Quick status update; this has seen agreement in the IF SRE meeting, the next step is to sort out which SRE would take care the day-to-day (which would seemingly be very little, but at least in the beginning it would mean some effort to deploy/review the scripts etc). I'll update the task when there has been progress.

@Urbanecm what ongoing support would you envision beyond setting up the VM and some sort of a deployment method and keeping up to date with security patches?

@Urbanecm what ongoing support would you envision beyond setting up the VM and some sort of a deployment method and keeping up to date with security patches?

Thanks for the question and for discussing this on Monday! I think day-to-day operation will need very little support from SREs. I envision the stewards will have a repository assigned, where they can push/review the deployed cool similar to bots we already maintain in Wikimedia Cloud. What would be needed from SREs in the long-term amounts to this:

  • Helping with puppet-touching stuff (both reviewing the patch and helping with writing it), incl. adding necessary secrets to private puppet as-needed.
  • Assisting when the integration with the source of information (MediaWiki, mostly) fails for reasons unrelated to the code itself
  • Processing cluster requests as-needed (adding stewards designated to have access and removing them whenever they quit from the group)

I expect ordinary business to be in a form of the application code change, rather than SRE-level changes (plus running the offboarding stuff of course).

Initially, I think the stewards would need the following (apart from the points you mentioned in your message):

  • Establishing a way how to integrate with other services; see examples as subpoints
    • The on/offboarding automation would need to be able to add/remove users from mailinglists handled in Mailman. How should we tell Mailman to do that?
    • We would need email data for stewards. Ideally, this would come from MediaWiki itself (as that already has the data), but this is not exposed anywhere. How should we access those data?
  • Ensuring how getting access for a new user would work well (shell groups each seem to have approval point of contact; would it make sense for that person to be a steward delegate in this case? i feel like there might be points to clarify, since this VM would essentially be owned by a recognized group of volunteers -- stewards)
  • Possibly: Sharing experience with offboarding the SREs do, if something applies for the stewards.

Does this clarify?

Since this is a single VM which can run in either DC, please create in codfw. we currently have way more space there.

LSobanski raised the priority of this task from Low to Medium.Sep 20 2023, 1:57 PM
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

Hi @LSobanski, @taavi mentioned to me privately that if we want the stewards machine to run ircservserv, as discussed during the meeting we had earlier this week, the machine might actually need to be in the public vlan (unlike what i originally specified in the VM request), to make it able to run an irc bot. Can that be clarified and if this is the correct understanding, the specification in the description updated?

Hi @LSobanski, @taavi mentioned to me privately that if we want the stewards machine to run ircservserv, as discussed during the meeting we had earlier this week, the machine might actually need to be in the public vlan (unlike what i originally specified in the VM request), to make it able to run an irc bot. Can that be clarified and if this is the correct understanding, the specification in the description updated?

We should define what external access is required, if possible it would be better to keep this in the private vlan and use the proxy service as external IP's are sparse (cc @ayounsi )

Indeed and hosts on public IPs have a much larger attack surface so they should be a last resort option. The ircbot might need to be audited too if it connects to servers outside of WMF.

Hi @LSobanski, @taavi mentioned to me privately that if we want the stewards machine to run ircservserv, as discussed during the meeting we had earlier this week, the machine might actually need to be in the public vlan (unlike what i originally specified in the VM request), to make it able to run an irc bot. Can that be clarified and if this is the correct understanding, the specification in the description updated?

We should define what external access is required, if possible it would be better to keep this in the private vlan and use the proxy service as external IP's are sparse (cc @ayounsi )

If we want to start with the IRC part of the onboarding, we would need to connect to irc.libera.chat via the IRC protocol. I'm not sure what's needed for that to work; I know that alert1001 is doing that for logmsgbot purposes, which is in the external VLAN.

If needed, we can also start with a different part of the onboarding; I don't have a strong preference. I created a list of ~20 places stewards need to be added to (removed from) at T346935: Create an on/offboarding system for Wikimedia Stewards. Starting with MediaWiki itself might be a reasonable choice too, as it has quite a few of places to update (although not as many as IRC channels), which would let us to keep this fully in-cluster.

Let me know what you think!

If needed, we can also start with a different part of the onboarding; I don't have a strong preference. I created a list of ~20 places stewards need to be added to (removed from) at T346935: Create an on/offboarding system for Wikimedia Stewards. Starting with MediaWiki itself might be a reasonable choice too, as it has quite a few of places to update (although not as many as IRC channels), which would let us to keep this fully in-cluster.

I'd say let's start with an internal IP. If there are unsurmountable issues in adding IRC support later (bugs in the proxy libraries or whatever), switching to a public IP is still an option and we'd have little to no data to migrate and the VM would simply be reimaged.

If needed, we can also start with a different part of the onboarding; I don't have a strong preference. I created a list of ~20 places stewards need to be added to (removed from) at T346935: Create an on/offboarding system for Wikimedia Stewards. Starting with MediaWiki itself might be a reasonable choice too, as it has quite a few of places to update (although not as many as IRC channels), which would let us to keep this fully in-cluster.

I'd say let's start with an internal IP. If there are unsurmountable issues in adding IRC support later (bugs in the proxy libraries or whatever), switching to a public IP is still an option and we'd have little to no data to migrate and the VM would simply be reimaged.

Ack, sounds good to me.

I am happy to take this on and create the VMs and my team is ok with being called the owner in puppet.

We should just really create it in both DCs. We have said before on multiple occasions we don't want to introduce any one-offs anymore and if we do it will just mean we have to coordinate again or have exceptions every 6 months.

dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 20 --cluster codfw -t T344164 --group B --os bookworm stewards2001
Ready to create Ganeti VM stewards2001.codfw.wmnet in the codfw cluster on group B with 1 vCPUs, 2.0GB of RAM, 20GB of disk in the private network.

1 CPU, 2GB RAM, 20GB disk as requested. codfw, private network, on Debian bookworm

creation in progress, just to start somewhere

Exception raised while parsing arguments for cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 334, in _safe_call
    ret_value = func(*args, **kwargs)
  File "/usr/lib/python3.9/argparse.py", line 1830, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 1863, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 1907, in _parse_known_args
    option_tuple = self._parse_optional(arg_string)
  File "/usr/lib/python3.9/argparse.py", line 2194, in _parse_optional
    if not arg_string[0] in self.prefix_chars:
TypeError: 'int' object is not subscriptable

probably because host name was not in partman config yet but also uncaught error


added to https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Servers

Change 972070 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/netboot: add special VMs for stewards

https://gerrit.wikimedia.org/r/972070

Change 972318 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.ganeti.makevm: fix parameter passed to reimage

https://gerrit.wikimedia.org/r/972318

Exception raised while parsing arguments for cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 334, in _safe_call
    ret_value = func(*args, **kwargs)
  File "/usr/lib/python3.9/argparse.py", line 1830, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 1863, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 1907, in _parse_known_args
    option_tuple = self._parse_optional(arg_string)
  File "/usr/lib/python3.9/argparse.py", line 2194, in _parse_optional
    if not arg_string[0] in self.prefix_chars:
TypeError: 'int' object is not subscriptable

Oops, this was a bug introduced recently with the puppet7 work. I've send a fix above.

@Dzahn once he above patch is merged you can proceed directly running the reimage cookbook on the host as the VM was correctly created and the last step was calling the reimage cookbook.

Change 972318 merged by jenkins-bot:

[operations/cookbooks@master] sre.ganeti.makevm: fix parameter passed to reimage

https://gerrit.wikimedia.org/r/972318

Thank you @Volans ! Got it, will do that :)

Change 972070 merged by Dzahn:

[operations/puppet@production] site/netboot: add special VMs for stewards

https://gerrit.wikimedia.org/r/972070

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards2001.codfw.wmnet with OS bookworm completed:

  • stewards2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311071902_dzahn_526197_stewards2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Thanks @Dzahn for making the VM! Following our IRC conversations, I'm putting a list of packages/requirements that I'd like to have on the stewards VM:

  • Utils: wget, curl, vim, less
  • Python: python3 itself, plus the following modules:
    • python3-requests
    • python3-requests-oauthlib
    • python3-yaml
    • python3-click
  • Other requirements
    • The ability to do MediaWiki API requests
    • Ideally, the ability to access HTTP proxies for outside network access (although we can get back to this later once a specific usecase is present if needed)
    • A managed clone of a Git repository (https://gitlab.wikimedia.org/repos/stewards/onboarding-system); personally, I'd be happy to use git pull as a deployment mechanism, either manually or automatically, but I'm not wed to a particular way
    • Not sure if worth mentioning, but I'd need a configuration file with some secrets (OAuth tokens, for example). Not sure how to provide for that.

It's likely that there'd be more as the project progresses, but initially, those are the things that come to my mind.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm completed:

  • stewards1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311072034_dzahn_548193_stewards1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 972485 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] stewards: create initial role and profile

https://gerrit.wikimedia.org/r/972485

Change 972485 merged by Dzahn:

[operations/puppet@production] stewards: create initial role and profile

https://gerrit.wikimedia.org/r/972485

Change 972490 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] stewards: fix package name for python3-requests-oauthlib

https://gerrit.wikimedia.org/r/972490

Change 972490 merged by Dzahn:

[operations/puppet@production] stewards: fix package name for python3-requests-oauthlib

https://gerrit.wikimedia.org/r/972490

Change 972874 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] admin: create group for stewards VMs

https://gerrit.wikimedia.org/r/972874

Thanks @Dzahn for making the VM! Following our IRC conversations, I'm putting a list of packages/requirements that I'd like to have on the stewards VM:

  • Utils: wget, curl, vim, less

Of course, no problem:)

Regardless of what I claimed on IRC, these are all already installed by default from the base profile I applied meanwhile:

[stewards1001:~] $ wget -V
GNU Wget 1.21.3 built on linux-gnu.

[stewards1001:~] $ curl -V
curl 7.88.1

[stewards1001:~] $ vim --help
VIM - Vi IMproved 9.0 (2022 Jun 28, compiled May 04 2023 10:24:44)

[stewards1001:~] $ less -V
less 590 (GNU regular expressions)
  • Python: python3 itself, plus the following modules:
    • python3-requests
    • python3-requests-oauthlib
    • python3-yaml
    • python3-click

python3-requests and python3-yaml were also installed by default.

python3-requests-oauthlib and python3-click I added in the new profile for the stewards VMs so puppet installed them.

ii  python3-requests              2.28.1+dfsg-1                        all          elegant and simple HTTP library for Python3, built for human beings
ii  python3-requests-oauthlib     1.3.0+ds-1                           all          module providing OAuthlib auth support for requests (Python 3)
ii  python3-yaml                  6.0-3+b2                             amd64        YAML parser and emitter for Python3
ii  python3-click                 8.1.3-2                              all          Wrapper around optparse for command line utilities - Python 3.x
  • The ability to do MediaWiki API requests
  • Ideally, the ability to access HTTP proxies for outside network access (although we can get back to this later once a specific usecase is present if needed)
[stewards1001:~] $ curl https://mediawiki.org
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

[stewards1001:~] $ env | grep -i proxy
no_proxy=wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1
NO_PROXY=wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1

I will add this to the puppet code. Where would you like it to be checked out in the file system? Under /srv sounds ok?

  • Not sure if worth mentioning, but I'd need a configuration file with some secrets (OAuth tokens, for example). Not sure how to provide for that.

Please add the secrets to your home dir on some existing prod server (or on this machine once you have the access), make them readable to just you and let me know where to find them. I will take them as root and add them to the private puppet repo and then make puppet add them to files. Also let me know where you want them to be written to ideally.

Change 972874 merged by Dzahn:

[operations/puppet@production] admin: create group for stewards VMs

https://gerrit.wikimedia.org/r/972874

I will add this to the puppet code. Where would you like it to be checked out in the file system? Under /srv sounds ok?

Yes, absolutely.

  • Not sure if worth mentioning, but I'd need a configuration file with some secrets (OAuth tokens, for example). Not sure how to provide for that.

Please add the secrets to your home dir on some existing prod server (or on this machine once you have the access), make them readable to just you and let me know where to find them. I will take them as root and add them to the private puppet repo and then make puppet add them to files. Also let me know where you want them to be written to ideally.

Will do. Is there some sort of standard/preferred location?

Will do. Is there some sort of standard/preferred location?

Not really, deploy1002 will do!

Will do. Is there some sort of standard/preferred location?

Not really, deploy1002 will do!

Sorry for the confusion, I meant for the "where you want them to be written to" part: Do we keep service configuration in a standard place?

Change 972882 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] stewards: add git::clone of stewards/onboarding-system from gitlab

https://gerrit.wikimedia.org/r/972882

Sorry for the confusion, I meant for the "where you want them to be written to" part: Do we keep service configuration in a standard place?

Ah, well then I'd just say under /etc/something.

Change 972882 merged by Dzahn:

[operations/puppet@production] stewards: add git::clone of stewards/onboarding-system from gitlab

https://gerrit.wikimedia.org/r/972882

This has now been added to puppet.

Notice: /Stage[main]/Profile::Stewards/Git::Clone[repos/stewards/onboarding-system]/Exec[git_clone_repos/stewards/onboarding-system]/returns: executed successfully

Let's fix this though:

[stewards2001:/srv/repos/onboarding-system] $ git status
fatal: detected dubious ownership in repository at '/srv/repos/onboarding-system'

Change 972896 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] stewards: use git:clone parameters to shared repo among several users

https://gerrit.wikimedia.org/r/972896

Change 972896 merged by Dzahn:

[operations/puppet@production] stewards: use git:clone parameters to share repo among several users

https://gerrit.wikimedia.org/r/972896

With the last puppet change above the git repo is now shared between users and there are no more warnings about the permissions and the wikidev group owns it.

rm -rf'ed the repo and only ran puppet on both machines and things look just fine. can run "git status" without root, no warnings.

We agreed on IRC at this point we can call this ticket resolved.

Shell access and possible follow-ups will be handled separately.

Dzahn renamed this task from 1 VMs requested for stewards to VMs requested for stewards.Nov 8 2023, 10:27 PM
Dzahn closed this task as Resolved.
Dzahn updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm executed with errors:

  • stewards1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311132001_dzahn_73821_stewards1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm executed with errors:

  • stewards1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311132324_dzahn_176746_stewards1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with OS bookworm completed:

  • stewards1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311140014_dzahn_200028_stewards1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB