Page MenuHomePhabricator

deploy1003 implementation tracking
Closed, ResolvedPublic

Description

deploy1003 implementation tracking

This task is to track the service implementation of deploy1003.

Once the linked racking task has been resolved, this task can be implemented.

This sub-task creation/update is per the request of serviceops; this task is assigned at creation to the 'Sub-team Technical Contact' provided in the initial ordering task.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn subscribed.

@akosiaris I made T364656 and suggest seeing that either as a parent task or simply merging this in there.

Thanks @Dzahn. It's fine as a parent task, thanks for adding it. T364416 already says bullseye for what is worth. Adding @jijiki for her information.

Change #1050345 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Add deploy1003 to site.pp

https://gerrit.wikimedia.org/r/1050345

Change #1050345 merged by Alexandros Kosiaris:

[operations/puppet@production] Add deploy1003 to site.pp

https://gerrit.wikimedia.org/r/1050345

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bookworm completed:

  • deploy1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406281315_akosiaris_1591191_deploy1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye completed:

  • deploy1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406281356_akosiaris_1599616_deploy1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

I 've applied the role and now working through packaging python3-imagecatalog for bullseye

  • python3-imagecatalog published and gerrit repo updated
  • php72 component made conditional

Noting that having deploy1003.eqiad.wmnet in deploy1002:/etc/dsh/group/scap-masters before it is fully set up is causing problems for scap deployments. For example, I got the following when trying to update scap today:

$ scap install-world
...
16:33:58 Syncing masters
16:33:59 ['/usr/bin/rsync', '--archive', '--delay-updates', '--delete', '--delete-delay', '--compress', '--new-compress', '--exclude=*.swp', '--exclude=**/__pycache__', 'deploy1002.eqiad.wmnet::scap-install-staging', '/var/lib/scap'] (ran as scap@deploy1003.eqiad.wmnet) returned [1]: This account is currently not available.

16:34:00 scap-sync-to-masters: 100% (in-flight: 0; ok: 1; fail: 1; left: 0)
16:34:00 1 masters failed to sync scap installation
Aborting: Install failed

And a backport deployer got bogged down by it:

$ scap backport <xxxxx>
....
20:04:09 Started sync-masters
20:04:19 sync-masters:  50% (in-flight: 1; ok: 1; fail: 0; left: 0) /
20:22:07 sync-masters: 100% (in-flight: 0; ok: 2; fail: 0; left: 0)
20:22:07 Finished sync-masters (duration: 17m 57s)   <---- Too long!

Change #1051392 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] deploy1003: Comment them out from scap_masters

https://gerrit.wikimedia.org/r/1051392

Change #1051392 merged by Alexandros Kosiaris:

[operations/puppet@production] deploy1003: Comment them out from scap_masters

https://gerrit.wikimedia.org/r/1051392

Apologies, I failed to anticipate that consequence, I 've merged a change to remove deploy1003 from the list of scap masters.

Change #1051772 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] deployment::rsync: Remove long absented resources

https://gerrit.wikimedia.org/r/1051772

Change #1051782 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] WIP deployment::rsync: Temporarily disable stunnel

https://gerrit.wikimedia.org/r/1051782

Change #1052111 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] deployment::rsync: Add support for PKI

https://gerrit.wikimedia.org/r/1052111

Change #1051772 merged by Alexandros Kosiaris:

[operations/puppet@production] deployment::rsync: Remove long absented resources

https://gerrit.wikimedia.org/r/1051772

Change #1051782 abandoned by Alexandros Kosiaris:

[operations/puppet@production] WIP deployment::rsync: Temporarily disable stunnel

Reason:

Better to move with PKI provisioned certs

https://gerrit.wikimedia.org/r/1051782

Change #1052111 merged by Alexandros Kosiaris:

[operations/puppet@production] deployment::rsync: Add support for PKI

https://gerrit.wikimedia.org/r/1052111

Change #1053718 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] deploy1003: Undo the puppet 7 force

https://gerrit.wikimedia.org/r/1053718

Change #1053718 merged by Alexandros Kosiaris:

[operations/puppet@production] deploy1003: Undo the puppet 7 force

https://gerrit.wikimedia.org/r/1053718

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye completed:

  • deploy1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407111551_akosiaris_4099339_deploy1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1056144 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] imagecatalog: Vary gunicorn package on Debian version

https://gerrit.wikimedia.org/r/1056144

Change #1056144 merged by Alexandros Kosiaris:

[operations/puppet@production] imagecatalog: Vary gunicorn package on Debian version

https://gerrit.wikimedia.org/r/1056144

Change #1056156 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] imagecatalog: Force uid/gid for imagecatalog user

https://gerrit.wikimedia.org/r/1056156

Change #1056156 merged by Alexandros Kosiaris:

[operations/puppet@production] imagecatalog: Force uid/gid for imagecatalog user

https://gerrit.wikimedia.org/r/1056156

Change #1056157 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] imagecatalog: Actually comment about the uid/gid

https://gerrit.wikimedia.org/r/1056157

Change #1056157 merged by Alexandros Kosiaris:

[operations/puppet@production] imagecatalog: Actually comment about the uid/gid

https://gerrit.wikimedia.org/r/1056157

Change #1056158 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] imagecatalog: Remove require for data directory

https://gerrit.wikimedia.org/r/1056158

Change #1056158 merged by Alexandros Kosiaris:

[operations/puppet@production] imagecatalog: Remove require for data directory

https://gerrit.wikimedia.org/r/1056158

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye executed with errors:

  • deploy1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407231406_akosiaris_2085260_deploy1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" deploy1003.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye completed:

  • deploy1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407241033_akosiaris_2246523_deploy1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Just armed keyholder, everything looks ok right now. I 'll send a notification to wikitech-l and engineering in slack for a deployment server move. Not much different from what we do for the switchover.

Change #1056871 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Revert "deploy1003: Comment them out from scap_masters"

https://gerrit.wikimedia.org/r/1056871

Change #1056871 merged by Alexandros Kosiaris:

[operations/puppet@production] Revert "deploy1003: Comment them out from scap_masters"

https://gerrit.wikimedia.org/r/1056871

Change #1056878 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] deployment: Switch master deployment host to deploy1003

https://gerrit.wikimedia.org/r/1056878

Move to this server from deploy1002 scheduled for Monday 2024-07-29 09:00 UTC

I 've also performed a NOOP deployment from deploy1003 today, worked slowly (20minutes) due to having to build the images, but otherwise OK)

Change #1056878 merged by Alexandros Kosiaris:

[operations/puppet@production] deployment: Switch master deployment host to deploy1003

https://gerrit.wikimedia.org/r/1056878

Change #1057833 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/dns@master] Switch deployment.eqiad.wmnet to deploy1003

https://gerrit.wikimedia.org/r/1057833

Change #1057833 merged by Alexandros Kosiaris:

[operations/dns@master] Switch deployment.eqiad.wmnet to deploy1003

https://gerrit.wikimedia.org/r/1057833

Hi, I failed to ssh deployment.eqiad.wmnet. The message I got is deployment.eqiad.wmnet: Permission denied (publickey). I was able to ssh a couple of months ago. Is this related to the deployment? Is there any update on my end that could fix it?

Hi, I failed to ssh deployment.eqiad.wmnet. The message I got is deployment.eqiad.wmnet: Permission denied (publickey). I was able to ssh a couple of months ago. Is this related to the deployment? Is there any update on my end that could fix it?

Hello, please open a new ticket and tag SRE SRE-Access-Requests to debug your issue. As far as I can see, you are not part of any group that has access to the deployment server. If this is a mistake and you should have access, please add the required information for an access request so the clinic duty SRE can help.

Because users have been asking about the changed host key, I dumped the fingerprints for deploy1003 on https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy1003.eqiad.wmnet

It didn't have the page that https://wikitech.wikimedia.org/wiki/Deploy1002 had and that just redirects now.

Because users have been asking about the changed host key,

Hi @Dzahn, thanks for the reply. Is it something I need to update on my end? If so, could you provide more instructions on how to do that?

@jwang My comment was a general comment about the new deployment server and the replacement process.

It shouldn't be related to your access question. It wouldn't manifest as "Permission denied (publickey)".

I can confirm what Clement said above. It seems like you simply aren't in the group that gets access to the deployment server. You have other groups like analytics-privatedata-users and analytics-product-users but not the one for deployments.

So if you are sure you used to have that access at some point in the past then we need to dig deeper into when and where you were removed.

But I also don't see a home directory for you there and these normally stay around forever, even for former users. So that would be strange even if you had access in the past.

The one thing that you can do to get this fixed is to create a new request and ask to be added to the deployment group. (https://wikitech.wikimedia.org/wiki/SRE/Production_access#Filing_the_request)

Thanks @Dzahn, I guess I mixed up with the other servers. I will create a request for the deployment group.