Page MenuHomePhabricator

Migrate Mailman/lists to Bullseye/Bookworm
Open, MediumPublic

Description

lists1001 is still on Buster. Many of the components comprising the Mailman setup are actually as recent as Bullseye (or even more recent/patched), so these need a closer look if we carry local patches etc. But in general from the Mailman perspective we're already quite close to Bullseye:

PackageVersion on lists1001Version in Bullseye
django-mailman31.3.5-2~bpo10+11.3.5-2
mailman-hyperkitty1.1.0-10~bpo10+11.1.0-10
mailman-suite0+20200530-2~bpo10+10+20200530-2
mailman33.3.3-1~bpo10+63.3.3-1
mailmanclient3.3.2-1~bpo10+23.3.2-1

There are various older considerations on use of public IPs covered at (https://phabricator.wikimedia.org/T278495), but it's probably useful to first upgrade lists1001 in place before moving to a new setup.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The packages were initially backports of the bullseye versions, but we have a bunch of random patches on top. On T286217#8572913 I wrote/suggested:

Our current Mailman deployment is a bunch of backported and forked debs with random patches thrown on top based on what we managed to fix upstream. It's not sustainable (as hopefully T286217#7406437 shows). Given that we need to get off buster anyways, I would suggest that we wait until the bookworm freeze gets more frozen and set up some lists1003 with normal Debian packages, and after some level of testing switch lists.wm.o over to the new host. The new version will have a new set of bugs, we either learn to live with them or patch via puppet.

MoritzMuehlenhoff renamed this task from Migrate Mailman/lists to Bullseye to Migrate Mailman/lists to Bullseye/Bookworm.Mar 15 2023, 9:29 AM

The packages were initially backports of the bullseye versions, but we have a bunch of random patches on top. On T286217#8572913 I wrote/suggested:

Our current Mailman deployment is a bunch of backported and forked debs with random patches thrown on top based on what we managed to fix upstream. It's not sustainable (as hopefully T286217#7406437 shows). Given that we need to get off buster anyways, I would suggest that we wait until the bookworm freeze gets more frozen and set up some lists1003 with normal Debian packages, and after some level of testing switch lists.wm.o over to the new host. The new version will have a new set of bugs, we either learn to live with them or patch via puppet.

That's a great idea. Work on getting the bookworm installer is ongoing ATM (T330495)

Change 902182 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] lists: new server to test bookworm functionality

https://gerrit.wikimedia.org/r/902182

Change 902182 merged by JHathaway:

[operations/puppet@production] lists: new server to test bookworm functionality

https://gerrit.wikimedia.org/r/902182

Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host lists1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host lists1003.wikimedia.org with OS bullseye completed:

  • lists1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231429_jhathaway_3111141_lists1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Change 902472 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] lists: Change role of lists1003

https://gerrit.wikimedia.org/r/902472

Change 902472 merged by JHathaway:

[operations/puppet@production] lists: Change role of lists1003

https://gerrit.wikimedia.org/r/902472

Change 902479 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] bookworm: use default mtail pkg

https://gerrit.wikimedia.org/r/902479

Change 902481 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] lists: allow lists1003 to grab a cert

https://gerrit.wikimedia.org/r/902481

Change 902481 merged by JHathaway:

[operations/puppet@production] lists: allow lists1003 to grab a cert

https://gerrit.wikimedia.org/r/902481

Change 902496 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] bookworm: Update spamassassin daemon name

https://gerrit.wikimedia.org/r/902496

Change 902501 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] apache2: Use systemd provider

https://gerrit.wikimedia.org/r/902501

Change 902479 merged by JHathaway:

[operations/puppet@production] bookworm: use default mtail pkg

https://gerrit.wikimedia.org/r/902479

Change 902496 merged by JHathaway:

[operations/puppet@production] bookworm: Update spamassassin daemon name

https://gerrit.wikimedia.org/r/902496

Change 902501 merged by JHathaway:

[operations/puppet@production] apache2: Use systemd provider

https://gerrit.wikimedia.org/r/902501

Change 902782 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] mtail: Update defaults for bookworm

https://gerrit.wikimedia.org/r/902782

Change 902782 merged by JHathaway:

[operations/puppet@production] mtail: Update defaults for bookworm

https://gerrit.wikimedia.org/r/902782

@Legoktm and @Ladsgroup I have setup a new host, lists1003.wikimedia.org, on bookworm. All the software is installed and most of the bookworm issues have been sorted out. However, mailman3 is not starting, since as of yet it has no db grants. My thought was to grant readonly rights and see if that is sufficient?

Change 902808 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] Add an in place Debian upgrade script

https://gerrit.wikimedia.org/r/902808

I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are at it, can you increase number of CPUs? two CPUs is way too few and that has been biting us in certain parts of our back multiple times.

I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are at it, can you increase number of CPUs? two CPUs is way too few and that has been biting us in certain parts of our back multiple times.

thanks @Ladsgroup, happy to increase the cpu count, any sense of what a good number would be?

I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are at it, can you increase number of CPUs? two CPUs is way too few and that has been biting us in certain parts of our back multiple times.

thanks @Ladsgroup, happy to increase the cpu count, any sense of what a good number would be?

I'd say let's double to four (current prod VMs have two) and we can easily increase further as needed.

Yeah, I was about to say from the application point of view, the more the better, like why not 400? But I don't know the limitations the infra so I can't say where to stop. We probably should eventually move it to bare metal but before that someone needs to actually take ownership of it.

I bumped the CPU count to four and as @MoritzMuehlenhoff mentioned we can always bump higher if the need arises.

Change 902808 merged by JHathaway:

[operations/puppet@production] Add an in place Debian upgrade script

https://gerrit.wikimedia.org/r/902808

Change 910598 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Add lists1003 grants for mailman dbs

https://gerrit.wikimedia.org/r/910598

Change 911847 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] httpd: always use systemd

https://gerrit.wikimedia.org/r/911847

Change 911847 merged by Jbond:

[operations/puppet@production] httpd: always use systemd

https://gerrit.wikimedia.org/r/911847

Dzahn subscribed.

T336555 has been opened about alerts related to lists1003. Seems like expected though since this is still WIP.

Change 927684 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] lists: Use stock mailman3 on bookworm

https://gerrit.wikimedia.org/r/927684

Change 927684 merged by JHathaway:

[operations/puppet@production] lists: Use stock mailman3 on bookworm

https://gerrit.wikimedia.org/r/927684

Change 910598 abandoned by Ladsgroup:

[operations/puppet@production] mariadb: Add lists1003 grants for mailman dbs

Reason:

https://gerrit.wikimedia.org/r/910598

Updating the host ownership in the Puppet role should also be part of this task.

Change #1024655 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] mailman: Take ownership of lists hosts

https://gerrit.wikimedia.org/r/1024655

eoghan updated Other Assignee, added: Arnoldokoth.
eoghan added a subscriber: jhathaway.

Change #1024655 merged by EoghanGaffney:

[operations/puppet@production] mailman: Change ownership of lists hosts to sre-collab and rename

https://gerrit.wikimedia.org/r/1024655

Change #1025741 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] WIP: lists: Add lists role and public IPs to list2001

https://gerrit.wikimedia.org/r/1025741

Change #1026157 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Add collaboration services as owner

https://gerrit.wikimedia.org/r/1026157

Change #1026157 merged by EoghanGaffney:

[operations/puppet@production] lists: Add collaboration services as owner

https://gerrit.wikimedia.org/r/1026157

Change #1025741 merged by EoghanGaffney:

[operations/puppet@production] lists: Add lists role to list2001

https://gerrit.wikimedia.org/r/1025741

Change #1035777 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Don't try to remove the mtail user when monitoring is absent

https://gerrit.wikimedia.org/r/1035777

Change #1035777 merged by EoghanGaffney:

[operations/puppet@production] lists: Don't try to remove the mtail user when monitoring is absent

https://gerrit.wikimedia.org/r/1035777

Change #1035785 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Add lists2001/lists1004 as allowed hosts for acmechief

https://gerrit.wikimedia.org/r/1035785

Change #1035785 merged by EoghanGaffney:

[operations/puppet@production] lists: Add lists2001/lists1004 as allowed hosts for acmechief

https://gerrit.wikimedia.org/r/1035785

Change #1036610 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Migrate mailman VIPs from lists1001 -> lists1004

https://gerrit.wikimedia.org/r/1036610

Change #1036686 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Update the quickdatacopy to use /var/lib/mailman3

https://gerrit.wikimedia.org/r/1036686