Page MenuHomePhabricator

upgrade people servers to trixie
Closed, ResolvedPublic

Description

people.wikimedia.org backends have traditionally been a good candidate to upgrade early when a new distro version is released

check if the install of trixie goes well

they are (mostly) just a simple apache and not critical. good way to see if any adjustments are needed before doing more important services.

Event Timeline

Change #1180977 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add people1005 with insetup role

https://gerrit.wikimedia.org/r/1180977

Change #1180977 merged by Dzahn:

[operations/puppet@production] site: add people1005 with insetup role

https://gerrit.wikimedia.org/r/1180977

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host people1005.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host people1005.eqiad.wmnet with OS trixie completed:

  • people1005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508212153_dzahn_2239507_people1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1180990 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add peopleweb role to people1005

https://gerrit.wikimedia.org/r/1180990

Change #1180990 merged by Dzahn:

[operations/puppet@production] site: add peopleweb role to people1005

https://gerrit.wikimedia.org/r/1180990

Dzahn changed the task status from Open to Stalled.Aug 22 2025, 5:18 PM

currently blocked on T402668

Mentioned in SAL (#wikimedia-operations) [2025-08-22T17:32:42Z] <dzahn@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on people1005.eqiad.wmnet with reason: T402596

Change #1181194 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add regex to include people2004 with insetup role

https://gerrit.wikimedia.org/r/1181194

Change #1181194 merged by Dzahn:

[operations/puppet@production] site: add regex to include people2004 with insetup role

https://gerrit.wikimedia.org/r/1181194

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host people2004.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host people2004.codfw.wmnet with OS trixie completed:

  • people2004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508222108_dzahn_3903597_people2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-08-26T17:44:39Z] <dzahn@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on people1005.eqiad.wmnet with reason: T402596

Change #1184553 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add peopleweb role to new peopleweb hosts again

https://gerrit.wikimedia.org/r/1184553

Change #1184553 merged by Dzahn:

[operations/puppet@production] site: add peopleweb role to new peopleweb hosts again

https://gerrit.wikimedia.org/r/1184553

Dzahn changed the task status from Stalled to In Progress.Sep 3 2025, 4:09 PM

We have envoy now and after some debugging in the linked subtask.. envoy and apache are working now.

Change #1184573 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: add additional rsync destination hosts

https://gerrit.wikimedia.org/r/1184573

Change #1184573 merged by Dzahn:

[operations/puppet@production] peopleweb: allow multiple rsync destination hosts

https://gerrit.wikimedia.org/r/1184573

LSobanski triaged this task as Low priority.
LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

Change #1186509 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Ignore backup failures from people1005 & people2004

https://gerrit.wikimedia.org/r/1186509

Dzahn changed the task status from In Progress to Stalled.Sep 9 2025, 5:57 PM

Putting this on pause because backups were failing and that turned out to be T404114.

Change #1186509 merged by Dzahn:

[operations/puppet@production] bacula: Ignore backup failures from people1005 & people2004

https://gerrit.wikimedia.org/r/1186509

Dzahn changed the task status from Stalled to Open.Sep 11 2025, 4:29 PM

unstalled. bacula issue was resolved

Change #1187884 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: make people2004 the new rsync source

https://gerrit.wikimedia.org/r/1187884

Change #1187885 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch people service aliases in eqiad and codfw to new trixie hosts

https://gerrit.wikimedia.org/r/1187885

Change #1187885 merged by Dzahn:

[operations/dns@master] switch people service aliases in eqiad and codfw to new trixie hosts

https://gerrit.wikimedia.org/r/1187885

Change #1187884 merged by Dzahn:

[operations/puppet@production] peopleweb: make people2004 the new rsync source

https://gerrit.wikimedia.org/r/1187884

rsynced data again, checked httpbb tests, monitoring fixed.

Sent mail to sre-at-large to announce this.