Page MenuHomePhabricator

Upgrade Observability mwlog hosts to trixie
Closed, ResolvedPublic

Description

These hosts currently run bullseye and need to be upgraded

  • mwlog1002 (superseded by mwlog1003)
  • mwlog2002 (superseded by mwlog2003)
  • turn down mwlog[12]002

Event Timeline

herron subscribed.

I can grab this, we have recently racked mwlog[12]003 hardware and can tackle the hw refresh and trixie upgrade at the same time

Change #1247106 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mwlog[12]003: apply role

https://gerrit.wikimedia.org/r/1247106

Change #1247106 merged by Herron:

[operations/puppet@production] mwlog[12]003: apply role

https://gerrit.wikimedia.org/r/1247106

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host mwlog1003.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host mwlog2003.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host mwlog1003.eqiad.wmnet with OS trixie completed:

  • mwlog1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603022323_herron_2193822_mwlog1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host mwlog2003.codfw.wmnet with OS trixie executed with errors:

  • mwlog2003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console mwlog2003.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1003 for host mwlog2003.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1003 for host mwlog2003.codfw.wmnet with OS trixie completed:

  • mwlog2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603030119_herron_2319883_mwlog2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

This task doesn't necessitate us to act on but brings T333731 back into focus. Trixie has a slightly more modern version of rsyslog so we can continue on while on trixie, but I assume that this is what held back mwlog upgrades as part of T353912.

Change #1247630 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mwlog: add trixie hosts to udp tee

https://gerrit.wikimedia.org/r/1247630

Change #1247630 merged by Herron:

[operations/puppet@production] mwlog: add trixie hosts to udp tee

https://gerrit.wikimedia.org/r/1247630

Change #1248564 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mwlog: copy archives to trixie hosts

https://gerrit.wikimedia.org/r/1248564

Change #1248564 merged by Herron:

[operations/puppet@production] mwlog: copy archives to trixie hosts

https://gerrit.wikimedia.org/r/1248564

Change #1249332 had a related patch set uploaded (by Herron; author: Herron):

[operations/mediawiki-config@master] udp2log: switch to new hosts

https://gerrit.wikimedia.org/r/1249332

This task doesn't necessitate us to act on but brings T333731 back into focus. Trixie has a slightly more modern version of rsyslog so we can continue on while on trixie, but I assume that this is what held back mwlog upgrades as part of T353912.

Good call, added some thoughts in T333731: Investigate pre-kafka log agent replacement for rsyslog

IIRC T353912: Observability Bookworm upgrades got deferred for non technical reasons

Change #1249332 merged by jenkins-bot:

[operations/mediawiki-config@master] udp2log: switch to new hosts

https://gerrit.wikimedia.org/r/1249332

Mentioned in SAL (#wikimedia-operations) [2026-03-09T17:23:35Z] <herron@deploy2002> Started scap sync-world: Backport for [[gerrit:1249332|udp2log: switch to new hosts (T417002)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-09T17:25:23Z] <herron@deploy2002> herron: Backport for [[gerrit:1249332|udp2log: switch to new hosts (T417002)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Change #1249362 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] networkpolicy: allow udp 8420 towards new mwlog hosts

https://gerrit.wikimedia.org/r/1249362

Change #1249362 merged by jenkins-bot:

[operations/deployment-charts@master] networkpolicy: allow udp 8420 towards new mwlog hosts

https://gerrit.wikimedia.org/r/1249362

Change #1250014 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mwlog: remove notion of primary/secondary

https://gerrit.wikimedia.org/r/1250014

Change #1250606 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mwlog: remove mwlog[12]002 from udp tee stream

https://gerrit.wikimedia.org/r/1250606

Change #1250014 merged by Herron:

[operations/puppet@production] mwlog: remove notion of primary/secondary

https://gerrit.wikimedia.org/r/1250014

Change #1250606 merged by Herron:

[operations/puppet@production] mwlog: remove mwlog[12]002 from udp tee stream

https://gerrit.wikimedia.org/r/1250606

herron triaged this task as Medium priority.Mar 11 2026, 2:51 PM
herron updated the task description. (Show Details)
herron updated the task description. (Show Details)
brennen subscribed.

Noting this one for fellow deployers as I was just surprised to find logs not being updated (and thus logspam-watch not working) on mwlog[12]002. A general announcement would probably be a good idea.

Thanks! I've updated the docs at https://wikitech.wikimedia.org/wiki/Wikimedia_binaries#mwlog_host from "probably mwlog[12]002" to "probably mwlog[12]003".

Old hosts will be down soon to address the idle logs issue

herron updated the task description. (Show Details)