Page MenuHomePhabricator

Migrate Mailman/lists to Bullseye/Bookworm
Open, MediumPublic

Description

lists1001 is still on Buster. Many of the components comprising the Mailman setup are actually as recent as Bullseye (or even more recent/patched), so these need a closer look if we carry local patches etc. But in general from the Mailman perspective we're already quite close to Bullseye:

PackageVersion on lists1001Version in Bullseye
django-mailman31.3.5-2~bpo10+11.3.5-2
mailman-hyperkitty1.1.0-10~bpo10+11.1.0-10
mailman-suite0+20200530-2~bpo10+10+20200530-2
mailman33.3.3-1~bpo10+63.3.3-1
mailmanclient3.3.2-1~bpo10+23.3.2-1

There are various older considerations on use of public IPs covered at (https://phabricator.wikimedia.org/T278495), but it's probably useful to first upgrade lists1001 in place before moving to a new setup.

Details

Other Assignee
Arnoldokoth
SubjectRepoBranchLines +/-
operations/puppetproduction+4 -16
operations/puppetproduction+20 -5
operations/puppetproduction+1 -1
operations/puppetproduction+8 -8
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+4 -2
operations/puppetproduction+1 -3
operations/puppetproduction+4 -4
operations/puppetproduction+5 -1
operations/puppetproduction+42 -1
operations/puppetproduction+2 -2
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+85 -101
operations/puppetproduction+3 -3
operations/puppetproduction+7 -6
operations/puppetproduction+3 -0
operations/puppetproduction+37 -23
operations/puppetproduction+8 -21
operations/puppetproduction+345 -0
operations/puppetproduction+35 -2
operations/puppetproduction+19 -8
operations/puppetproduction+9 -7
operations/puppetproduction+3 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+5 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 902501 merged by JHathaway:

[operations/puppet@production] apache2: Use systemd provider

https://gerrit.wikimedia.org/r/902501

Change 902782 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] mtail: Update defaults for bookworm

https://gerrit.wikimedia.org/r/902782

Change 902782 merged by JHathaway:

[operations/puppet@production] mtail: Update defaults for bookworm

https://gerrit.wikimedia.org/r/902782

@Legoktm and @Ladsgroup I have setup a new host, lists1003.wikimedia.org, on bookworm. All the software is installed and most of the bookworm issues have been sorted out. However, mailman3 is not starting, since as of yet it has no db grants. My thought was to grant readonly rights and see if that is sufficient?

Change 902808 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] Add an in place Debian upgrade script

https://gerrit.wikimedia.org/r/902808

I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are at it, can you increase number of CPUs? two CPUs is way too few and that has been biting us in certain parts of our back multiple times.

I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are at it, can you increase number of CPUs? two CPUs is way too few and that has been biting us in certain parts of our back multiple times.

thanks @Ladsgroup, happy to increase the cpu count, any sense of what a good number would be?

I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are at it, can you increase number of CPUs? two CPUs is way too few and that has been biting us in certain parts of our back multiple times.

thanks @Ladsgroup, happy to increase the cpu count, any sense of what a good number would be?

I'd say let's double to four (current prod VMs have two) and we can easily increase further as needed.

Yeah, I was about to say from the application point of view, the more the better, like why not 400? But I don't know the limitations the infra so I can't say where to stop. We probably should eventually move it to bare metal but before that someone needs to actually take ownership of it.

I bumped the CPU count to four and as @MoritzMuehlenhoff mentioned we can always bump higher if the need arises.

Change 902808 merged by JHathaway:

[operations/puppet@production] Add an in place Debian upgrade script

https://gerrit.wikimedia.org/r/902808

Change 910598 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Add lists1003 grants for mailman dbs

https://gerrit.wikimedia.org/r/910598

Change 911847 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] httpd: always use systemd

https://gerrit.wikimedia.org/r/911847

Change 911847 merged by Jbond:

[operations/puppet@production] httpd: always use systemd

https://gerrit.wikimedia.org/r/911847

Dzahn subscribed.

T336555 has been opened about alerts related to lists1003. Seems like expected though since this is still WIP.

Change 927684 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] lists: Use stock mailman3 on bookworm

https://gerrit.wikimedia.org/r/927684

Change 927684 merged by JHathaway:

[operations/puppet@production] lists: Use stock mailman3 on bookworm

https://gerrit.wikimedia.org/r/927684

Change 910598 abandoned by Ladsgroup:

[operations/puppet@production] mariadb: Add lists1003 grants for mailman dbs

Reason:

https://gerrit.wikimedia.org/r/910598

Updating the host ownership in the Puppet role should also be part of this task.

Change #1024655 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] mailman: Take ownership of lists hosts

https://gerrit.wikimedia.org/r/1024655

eoghan updated Other Assignee, added: Arnoldokoth.
eoghan added a subscriber: jhathaway.

Change #1024655 merged by EoghanGaffney:

[operations/puppet@production] mailman: Change ownership of lists hosts to sre-collab and rename

https://gerrit.wikimedia.org/r/1024655

Change #1025741 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] WIP: lists: Add lists role and public IPs to list2001

https://gerrit.wikimedia.org/r/1025741

Change #1026157 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Add collaboration services as owner

https://gerrit.wikimedia.org/r/1026157

Change #1026157 merged by EoghanGaffney:

[operations/puppet@production] lists: Add collaboration services as owner

https://gerrit.wikimedia.org/r/1026157

Change #1025741 merged by EoghanGaffney:

[operations/puppet@production] lists: Add lists role to list2001

https://gerrit.wikimedia.org/r/1025741

Change #1035777 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Don't try to remove the mtail user when monitoring is absent

https://gerrit.wikimedia.org/r/1035777

Change #1035777 merged by EoghanGaffney:

[operations/puppet@production] lists: Don't try to remove the mtail user when monitoring is absent

https://gerrit.wikimedia.org/r/1035777

Change #1035785 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Add lists2001/lists1004 as allowed hosts for acmechief

https://gerrit.wikimedia.org/r/1035785

Change #1035785 merged by EoghanGaffney:

[operations/puppet@production] lists: Add lists2001/lists1004 as allowed hosts for acmechief

https://gerrit.wikimedia.org/r/1035785

Change #1036610 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Migrate mailman VIPs from lists1001 -> lists1004

https://gerrit.wikimedia.org/r/1036610

Change #1036686 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Update the quickdatacopy to use /var/lib/mailman3

https://gerrit.wikimedia.org/r/1036686

Mentioned in SAL (#wikimedia-operations) [2024-06-04T09:01:51Z] <moritzm> imported python3-xapian-haystack 2.1.1-1+deb12u1 to bookworm-wikimedia (already lined up for the next Bookworm point release to address https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1066136 and needed for the update of the Mailman servers T331706

Change #1036686 merged by EoghanGaffney:

[operations/puppet@production] lists: Update the quickdatacopy to use /var/lib/mailman3

https://gerrit.wikimedia.org/r/1036686

The rough outline for migration is:

1: stop mail arriving inbound, wait for queues to clear out
2: migrate data, VIPs and service from old host to new host
3: run the required upgrade steps
4: test web UI on new host
5: allow mail to arrive inbound

More detailed step-by-step plan for migrating from the old hosts to the new host (lists1001 -> lists1004):

Prep:

  • Merge puppet change to block incoming mail on lists1001 and lists1004
  • Ensure the queue is empty on lists1001 (lists1001: sudo find /var/lib/mailman3/queue/{in,out} | wc -l)
  • Stop mailman on lists1001 (lists1001: sudo systemctl stop mailman3; systemctl stop mailman3-web)

Migrate:

  • Ensure data is synced from lists1001 to lists1004/lists2001 (sudo /usr/local/sbin/sync-var-lib-mailman)
  • Merge CR migrating VIPs from lists1001, and switching primary host to lists1004 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036610)
  • Run puppet agent on lists1001, ensure VIPs are removed and exim4 config does not contain the lists VIPs for routing mail (lists1001: sudo grep 208.80.154.21 /etc/exim4/exim4.conf)
  • Run puppet agent on lists1004, ensure VIPs are added and exim4 config does contains the lists VIPs (lists1004: sudo grep 208.80.154.21 /etc/exim4/exim4.conf)

Post-upgrade:

  • Run the following post-upgrade steps on the new host, lists1004:
    • mailman-web migrate
    • mailman-web compress
    • mailman-web collectstatic
    • mailman-web compilemessages
    • mailman-web rebuild_index

Restore:

  • Start mailman-web on lists1004 and verify (lists1004: sudo systemctl start mailman-web)
  • Test mail delivery locally
  • Merge puppet change to unblock incoming mail on lists1004
  • Re-enable puppet on all hosts (cumin: sudo cumin 'A:lists' 'sudo puppet agent --enable)

Rolling back:

We can undo this at any point up to allowing mail to arrive on the new host, by reverting the puppet change to migrate the VIPs and service. After that we need to allow for some mails to have been sent to exim but potentially not be delivered and we can deal with this as it comes.

Overall looks good. Just noting that rebuilding index will take a very long time and that can make the downtime quite longer. I wonder of we can just rsync the indexes and avoid that? We probably can also run rebuild index after migration (and note to people that search won't work for a while)

Change #1041232 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Remove quickdatacopy and use our own rsyncd and systemd timer

https://gerrit.wikimedia.org/r/1041232

Change #1041232 merged by EoghanGaffney:

[operations/puppet@production] lists: Remove quickdatacopy and use our own rsyncd and systemd timer

https://gerrit.wikimedia.org/r/1041232

Overall looks good. Just noting that rebuilding index will take a very long time and that can make the downtime quite longer. I wonder of we can just rsync the indexes and avoid that? We probably can also run rebuild index after migration (and note to people that search won't work for a while)

We can rsync the indices but I'm not sure they'll work -- the upgrade docs call out specifically that indices need to be rebuilt. I think you're correct though that we can start allowing mail to flow while letting the indices continue to run in the background. Although that said, it mentions python2 to python3 compatibility, so we should definitely test this before we kick off a big rebuild.

I've created a sub-task for the migration itself so users and community members can follow the migration itself more easily, rather than trawling through comments and patch notifications. It's been tagged with User-notice so it ends up on tech news. The downtime will be on Tuesday 18th from 10-12 UTC.

https://phabricator.wikimedia.org/T367521

Change #1043799 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Block incoming email on lists hosts during mailman migration

https://gerrit.wikimedia.org/r/1043799

Change #1046785 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Switch DB firewall rules to use primary host variable

https://gerrit.wikimedia.org/r/1046785

Change #1046786 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Allow mail to be received on lists1004

https://gerrit.wikimedia.org/r/1046786

Change #1043799 merged by EoghanGaffney:

[operations/puppet@production] lists: Block incoming email on lists hosts during mailman migration

https://gerrit.wikimedia.org/r/1043799

Change #1036610 merged by EoghanGaffney:

[operations/puppet@production] lists: Migrate mailman primary host from lists1001 -> lists1004

https://gerrit.wikimedia.org/r/1036610

Change #1046785 merged by EoghanGaffney:

[operations/puppet@production] lists: Switch DB firewall rules to use primary host variable

https://gerrit.wikimedia.org/r/1046785

Change #1046786 merged by EoghanGaffney:

[operations/puppet@production] lists: Allow mail to be received on lists1004

https://gerrit.wikimedia.org/r/1046786

Change #1047094 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Add symlink to /var/lib/mailman3 when using different root

https://gerrit.wikimedia.org/r/1047094

Change #1047101 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Change lists sync to use quickdatacopy

https://gerrit.wikimedia.org/r/1047101

Change #1047101 merged by EoghanGaffney:

[operations/puppet@production] lists: Change lists sync to use quickdatacopy

https://gerrit.wikimedia.org/r/1047101

Change #1047160 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] lists: fix invalid unit name for rsync::quickdatacopy

https://gerrit.wikimedia.org/r/1047160

Change #1047160 merged by Dzahn:

[operations/puppet@production] lists: fix invalid unit name for rsync::quickdatacopy

https://gerrit.wikimedia.org/r/1047160

Mentioned in SAL (#wikimedia-operations) [2024-06-18T19:17:51Z] <mutante> lists1001 - systemctl reset-failed - clean up systemd state due to units not found anymore after migration - disable puppet and then deploy gerrit:1047160 on lists to fix invalid unit name - T331706

After a little follow-up fix rsync::quickdatacopy is now in use and copies both from and to new path /srv/mailman3 (and /var/lib/mailman as before).

lists2001 pulls from lists1004 without issues now and lists1001 has no syncing services.

Change #1047184 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] mailman3: remove buster support

https://gerrit.wikimedia.org/r/1047184

Change #1047925 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Move lists1001 to insetup::buster

https://gerrit.wikimedia.org/r/1047925

Change #1047939 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Switch lists1001 to insetup::buster

https://gerrit.wikimedia.org/r/1047939

Change #1047939 abandoned by EoghanGaffney:

[operations/puppet@production] lists: Switch lists1001 to insetup::buster

Reason:

moritzm beat me to it :D I021e433f6d0ecb1b5eaa26fe69fd09a719854979

https://gerrit.wikimedia.org/r/1047939

Change #1047925 merged by Muehlenhoff:

[operations/puppet@production] Move lists1001 to insetup::buster

https://gerrit.wikimedia.org/r/1047925

Change #1047094 merged by EoghanGaffney:

[operations/puppet@production] lists: Add symlink to /var/lib/mailman3 when using different root

https://gerrit.wikimedia.org/r/1047094

Change #1047184 merged by EoghanGaffney:

[operations/puppet@production] mailman3: remove buster support

https://gerrit.wikimedia.org/r/1047184

The migration to the new host is done. The last remaining item before we can close this ticket is to decommission the old host. We're going to keep that around for two weeks after the migration, which will be Tuesday 2nd July. The host will be shut down on that date, and decommissioned on the Tuesday after.

There is an alert in Icinga that says there are too many runners.

"PROCS CRITICAL: 15 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'"

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=lists1004

It looks like it's configured to alert when it's not exactly 14. Maybe that's just too strict.

nrpe_command => '/usr/lib/nagios/plugins/check_procs -c 14:14 -u list --ereg-argument-array=\'/usr/lib/mailman3/bin/runner\'',

from profile::lists::monitoring