Page MenuHomePhabricator

Figure out a way to sync old and new mailman
Closed, ResolvedPublic

Description

Mailman3 has a script to migrate the old archives to the new system but there's no way to do it the other way around. Let's find a way to keep the old and new structure the same.

Event Timeline

Mail from @Ladsgroup to list admins:

Mailman allows us to upgrade mailing list by mailing list, that's good but we haven't found a way to keep the old version and the new ones in sync (archives, etc.). Maybe we migrate a mailing list and the archives for the old version will stop getting updated. Would that work for you? Feel free to chime in: https://phabricator.wikimedia.org/T256539

As an amateurish list admin, I chime in saying: surely it would. It's probably easier to put a link on the "new" archives saying "For archives prior to Day X, please see <here>" than synching everything forever?

To be clear, we can import all of the legacy mailman 2 archives into hyperkitty, right?

If so, I have a rough idea on how to set up a redirector from mailman2 URLs to hyperkitty. Each email has a unique Message-Id (though if a message was cross-posted, it might not be globally unique), which we can use to match the URLs. I'm currently scraping the pipermail archives to make a mapping of URL to Message-Id. Then I'd figure out how to parse the hyperkitty archives and set up a redirector.

To be clear, we can import all of the legacy mailman 2 archives into hyperkitty, right?

According to the documentation of mailman it should be possible: https://docs.mailman3.org/en/latest/migration.html#upgrade-strategy

If so, I have a rough idea on how to set up a redirector from mailman2 URLs to hyperkitty. Each email has a unique Message-Id (though if a message was cross-posted, it might not be globally unique), which we can use to match the URLs. I'm currently scraping the pipermail archives to make a mapping of URL to Message-Id. Then I'd figure out how to parse the hyperkitty archives and set up a redirector.

The problem is that URLs in mails in hypetkitty are using some hash (I don't know what type of hash tbh). Here's an example https://lists.wmcloud.org/hyperkitty/list/test@lists.wmcloud.org/thread/RMQPKSS4ID3WALFXAF636J2NGBVCN3UA/

Each email has a unique Message-Id (though if a message was cross-posted, it might not be globally unique), which we can use to match the URLs. I'm currently scraping the pipermail archives to make a mapping of URL to Message-Id.

My assumption that every email has a Message-Id might not be true. As of last week, there are 1.02M archived mails in pipermail, and 61k don't have a Message-Id. Some of them look to be where people have messed with the archives (redacting something I guess) but others I couldn't figure out a reason. Maybe they're in the mbox archive? Someone with shell access would need to examine those I think. I can upload the 123M .csv file I have if people want to take a look.

(Sidenote: some lists have way too big archives: T262773: Stop archiving the wikidata-bugs mailinglist in pipermail)

To be clear, we can import all of the legacy mailman 2 archives into hyperkitty, right?

According to the documentation of mailman it should be possible: https://docs.mailman3.org/en/latest/migration.html#upgrade-strategy

Can we try importing a list? I'm worried about "The one defect that will definitely cause problems is lines beginning with From in message bodies." which we've been tracking as T115329: "From" at start of line becomes ">From" in pipermail.

The problem is that URLs in mails in hypetkitty are using some hash (I don't know what type of hash tbh). Here's an example https://lists.wmcloud.org/hyperkitty/list/test@lists.wmcloud.org/thread/RMQPKSS4ID3WALFXAF636J2NGBVCN3UA/

I think the API https://lists.wmcloud.org/hyperkitty/api/ should be good enough for what I'm trying to do, but it would be nice if we had a imported list to test with.

Coming back to this, I think we should:

  • Get the Internet Archive to scrape all the current pipermail archives/views as a fail-safe (this is currently prevented with https://lists.wikimedia.org/robots.txt)
  • Finish generating the mapping of pipermail URL -> message-id so we can set up a redirector to hyperkitty archives
  • Import the lists into mailman3, see if we're satisfied with the new archives + redirects (need to determine a criteria of "satisfied")
    • If not, create a lists2-static.wikimedia.org archive with the final HTML archives and redirect lists.wikimedia/pipermail/* to it.

There is the archive aspect of upgrade, there's also the double support aspect of the upgrade that bothers me a lot and couldn't come up with a good solution for yet. Imagine we want to have migration period that we have mailman2 and 3 at the same time in production and slowly upgrade one mailing list after the other. How that should look like?

  • Should both live in lists.wikimedia.org?
    • Meaning two roles in mailman1001, exim handling the routing to different mailmans which will be quite fun and so many other complexities.
  • Should migrated mailing lists live in lists3.wikimedia.org or lists-next.wikimedia.org and then moved back later.
    • How that's possible? Wouldn't it hurt the archives since domains are changed after migration?
  • Maybe come up with a cool but different name like "lists-new.wikimedia.org" and after migration leave it like that?
  • Mailman3 can handle different domains than the email domain.
    • So keep the web in lists-next.wikimedia.org (and ultimately change it back) and the mails being sent to lists.wikimedia.org
      • How do we deliver mail from the old one to this one?

python.org and some other FLOSS orgs has done an upgrade. We should take a look how it's done in those places.

You did skip over the easy option - declare a downtime for X hours, migrate everything over, and then bring it all back up on mailman3. Implementation wise it's the easiest but requires us to be very confident in our testing that stuff won't go wrong and when it inevitably does, be ready to immediately fix stuff. I don't think we're going to reach that level of confidence though.

Note that I haven't thought about how to go about implementing this yet.

There is the archive aspect of upgrade, there's also the double support aspect of the upgrade that bothers me a lot and couldn't come up with a good solution for yet. Imagine we want to have migration period that we have mailman2 and 3 at the same time in production and slowly upgrade one mailing list after the other. How that should look like?

In my head the upgrade happens in 3ish stages:

  1. Opt-in test lists where the admins volunteer to go first
  2. Small/medium lists
  3. Large, modern lists
    • 3.5. Large, very old lists (wikipedia-l, wikitech-l, etc. Anything with content before, say, 2010? Not sure where the cutoff should be)

It would be nice if we could do the archive imports ahead of time, that would help us track down issues, especially with the legacy lists. Actually that's my main worry right now, the exim stuff seems rather straightforward.

  • Should both live in lists.wikimedia.org?
  • During the migration period, sending emails to <name>@lists.wikimedia.org should work, regardless of which mailman is running the list. Ideally users would also get email via lists.wikimedia.org so mail filters, etc. will all continue to work
  • In the end, mailman3 should be on lists.wikimedia.org.
  • Meaning two roles in mailman1001, exim handling the routing to different mailmans which will be quite fun and so many other complexities.

I guess we'd need to have a list in puppet/hiera for those that are migrated to route them differently.

  • Should migrated mailing lists live in lists3.wikimedia.org or lists-next.wikimedia.org and then moved back later.

...if we have to. The URL paths should be different right?

So in theory we could route ^/(pipermail|mailman)/ to mailman2 and ^/(archives|admin)/ to mailman3.

Now that I've typed it out, I don't think it would be too bad if we can figure out the exim config. We would just install the mailman3 stuff on the existing lists1001 VM, have a list of all the non-migrated lists in puppet, pass that list/regex to exim and apache to route mails and web requests accordingly. As we migrate lists, we remove them from the puppet/hiera list and exim/apache will reroute to mailman3 accordingly. (Please, poke holes in this)

As a bonus, we keep the same IP, which means it doesn't lose reputation in all the spam filtering things out there.

  • Maybe come up with a cool but different name like "lists-new.wikimedia.org" and after migration leave it like that?

-1

python.org and some other FLOSS orgs has done an upgrade. We should take a look how it's done in those places.

Definitely. I know of https://fedoraproject.org/wiki/Mailman3_Migration as well.

You don't need a list of mailing lists in puppet, the exim4 routing checks for existence of directories that exist under mailman3 for that specific mailing list, otherwise it doesn't route them.

I think there's a clear picture on what to do next now. I call this resolved and already created tickets for installing mailman3 on lists1001.wikimedia.org T278610: Install mailman3 on lists1001.wikimedia.org