Page MenuHomePhabricator

The Great Clean Up of Mailman2
Open, Needs TriagePublic

Description

Once T52864: Upgrade GNU Mailman from 2.1 to Mailman3 is done, we need to clean up lots of stuff:

  • Refactor puppet.
  • Disable mailman2
  • Remove the apache auth system and cut access to /private/ entierly
  • Remove emails from old archives (Note: mm3 upgrade sometimes, like 0.01% fails to upgrade a mail, should we clean them up manually and delete? Should we keep those? How to find them?
  • Delete all mm2 mail configs (with rmlist?)

Event Timeline

@jcrespo before we embark upon this cleanup, can we mark one of the backups of var-lib-mailman to be kept long term? Per https://wikitech.wikimedia.org/wiki/Bacula#Retention it seems the normal backups are only kept for 3 months - could we have one kept for a year or two? We're not actually ready for this yet, I just wanted to check if it was possible.

We cannot mark existing backup to be kept long term. But we can generate new backups on the archive schedule/pool, which will be retained for 5 years. If it is an old backup, we can recover it and backup with this new retention in the archive pool: https://wikitech.wikimedia.org/wiki/Bacula#Configured_Pools

Ack, we haven't deleted anything yet so creating a new backup should work. I'll ping you again once we're ready for that, thanks!

I'll ping you again once we're ready for that, thanks!

I think we are ready. Assigning as requested in IRC.

Aklapper added a parent task: Restricted Task.May 24 2021, 5:29 PM

Change 694210 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mailman2: Generate a 5-year retention Archive backups of mailman

https://gerrit.wikimedia.org/r/694210

Change 694210 merged by Jcrespo:

[operations/puppet@production] mailman2: Generate a 5-year retention Archive backups of mailman

https://gerrit.wikimedia.org/r/694210

Change 694354 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mailman2: Disable temporarily production mailman2 backups

https://gerrit.wikimedia.org/r/694354

Change 694354 merged by Jcrespo:

[operations/puppet@production] mailman2: Disable temporarily production mailman2 backups

https://gerrit.wikimedia.org/r/694354

Ready when you are:

Run Backup job
JobName:  lists1001.wikimedia.org-Weekly-Mon-Archive-var-lib-mailman
Level:    Full
Client:   lists1001.wikimedia.org-fd
FileSet:  var-lib-mailman
Pool:     Archive (From Job resource)
Storage:  backup1001-FileStorageArchive (From Pool resource)
When:     2021-05-25 09:52:03
Priority: 10
OK to run? (yes/mod/no):

This is now scheduled, I will monitor and give a heads up when it finishes.

20 Gigabytes backed up so far (1/6th), it is normal a full backup takes a lot of time there due to many small files.

20.83 G lists1001.wikimedia.org-Weekly-Mon-Archive-var-lib-mailman is running

You can track the progress at: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=lists1001.wikimedia.org-Weekly-Mon-Archive-var-lib-mailman

It will likely take until the Wednesday GMT morning to finish (2 337 957 files /61.73 G ongoing).

jcrespo moved this task from Ready to Done on the Data-Persistence-Backup board.

The backup finished, JobId=338470:

Elapsed time:           14 hours 53 mins 5 secs
SD Files Written:       6,117,027
SD Bytes Written:       155,327,853,326 (155.3 GB)

Do you want to do a (partial) test recovery before deleting to prove you can restore an arbitary subset of files?

Could you give me some meaningful restore operation (subdir). I am guessing recovering all will not be wanted because of time and space available. I can recover it elsewhere (not in place) and then you can compare with existing data. E.g. files for a list you are about to remove?

Something like /var/lib/mailman/archives/private/cloud-announce.mbox/cloud-announce.mbox and /var/lib/mailman/lists/cloud-announce/config.pck would be great.

The recovery as requested has been scheduled. FYI, there were other files inside /var/lib/mailman/lists/cloud-announce/, but were not marked for recovery.

The files recovered should appear soon (with full path) inside:
/var/tmp/bacula-restores

On a real case I would kill the ongoing backup to force the recovery start immediately, but as it is a test I would let ongoing backups to finish first, and the restore should eventually execute.

It should have ran already, can you check?

Termination:            Restore OK
root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/archives/private/cloud-announce.mbox# cmp cloud-announce.mbox /var/lib/mailman/archives/private/cloud.mbox/cloud.mbox 
cloud-announce.mbox /var/lib/mailman/archives/private/cloud.mbox/cloud.mbox differ: byte 6, line 1
root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/archives/private/cloud-announce.mbox# cmp cloud-announce.mbox /var/lib/mailman/archives/private/cloud-announce.mbox/cloud-announce.mbox 
root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/archives/private/cloud-announce.mbox#

Archive looks good but config differs:

root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/lists/cloud-announce# cmp config.pck /var/lib/mailman/lists/cloud-announce/config.pck
config.pck /var/lib/mailman/lists/cloud-announce/config.pck differ: byte 46, line 2

Can't say why

The backups ran yesterday, could have it changed since then? Is there a human readable way to see what changed?

Technically no, we disabled the list a while ago and now http requests to it would be redirected to mm3 but it's possible for example due to periodic clean up of the held messages, or other things. Can't say for sure. I tried cmp -l file1.bin file2.bin | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}' but that didn't give me anything useful TBH.

root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/archives/private/cloud-announce.mbox# cmp cloud-announce.mbox /var/lib/mailman/archives/private/cloud.mbox/cloud.mbox 
cloud-announce.mbox /var/lib/mailman/archives/private/cloud.mbox/cloud.mbox differ: byte 6, line 1
root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/archives/private/cloud-announce.mbox# cmp cloud-announce.mbox /var/lib/mailman/archives/private/cloud-announce.mbox/cloud-announce.mbox 
root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/archives/private/cloud-announce.mbox#

Archive looks good but config differs:

root@lists1001:/var/tmp/bacula-restores/var/lib/mailman/lists/cloud-announce# cmp config.pck /var/lib/mailman/lists/cloud-announce/config.pck
config.pck /var/lib/mailman/lists/cloud-announce/config.pck differ: byte 46, line 2

Can't say why

After some *python2* magic (sys.path.append) and copy and pasting from stackoverflow, here's a diff of the two files:
{P16243}

and it looks fine to me. Maybe the byte difference is from internal changes to the serialization format, like set() or dictionary order changing when mailman opened and closed the file?

Macro such-data:

Then it's good. Let's clean up 🧹

This helps clarify it was certainly not some bit-flipping-on-the-wire kind of corruption in our backup system, which would impact all of bacula jobs. Thanks for looking into it. My guess is some global changes could impact local config. There is a few criticisms to do for bacula, but so far in terms of storage and retrieval it was very reliable. Thank you.

Change 697631 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/puppet@production] mailman: Absent mm2 script files

https://gerrit.wikimedia.org/r/697631

Change 697632 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/puppet@production] mailman: Drop mm2 scripts

https://gerrit.wikimedia.org/r/697632

Mentioned in SAL (#wikimedia-operations) [2021-06-01T17:23:42Z] <Amir1> starting deletion of mbox files on lists1001 for mailman2, first reading-web-team.mbox, then smallest lists (T282303)

Change 697634 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/puppet@production] mailman: Absent configuration files of mailman2 and make package absent

https://gerrit.wikimedia.org/r/697634

Change 697635 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/puppet@production] mailman: Drop absented files and packages

https://gerrit.wikimedia.org/r/697635

Change 697637 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/puppet@production] backup: Drop mm2 exclude backups

https://gerrit.wikimedia.org/r/697637

Change 697638 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/puppet@production] mailman: Drop cgi in apache and access to private/

https://gerrit.wikimedia.org/r/697638

Change 697631 merged by Legoktm:

[operations/puppet@production] mailman: Absent mm2 script files and their systemd timers

https://gerrit.wikimedia.org/r/697631

Change 697632 merged by Legoktm:

[operations/puppet@production] mailman: Drop mm2 scripts

https://gerrit.wikimedia.org/r/697632

Change 697638 merged by Legoktm:

[operations/puppet@production] mailman: Drop cgi in apache and access to private/

https://gerrit.wikimedia.org/r/697638

Change 697634 merged by Legoktm:

[operations/puppet@production] mailman: Absent configuration files of mailman2 and make package absent

https://gerrit.wikimedia.org/r/697634

Now that the mailman2 package is gone, if we need to unpickle a config file to look at it we'll need to install MM2 in a container locally or something. Not a huge issue, just something to keep in mind. It would've been an issue anyways when we switched to bullseye / a new VM.

Maybe with virtualenv?

for example from the source code but that'll be "fun"

Mentioned in SAL (#wikimedia-operations) [2021-06-05T15:21:21Z] <Amir1> delete mbox files of group D and E in mm2 (T282303)

Mentioned in SAL (#wikimedia-operations) [2021-06-05T16:16:11Z] <Amir1> deleting all private archives of mm2. All are inaccessible now (T282303)

Change 698306 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/puppet@production] mailman: Drop lists3 role

https://gerrit.wikimedia.org/r/698306

Mentioned in SAL (#wikimedia-operations) [2021-06-09T02:56:43Z] <Amir1> clean up of the rest of mbox files (except arbcom) (T282303)