Page MenuHomePhabricator

Implement static redirects from pipermail archives to hyperkitty archives
Closed, ResolvedPublic

Description

In T256539#6456966 I scraped the public pipermail archives and found that we have about 1 million URLs that we can set up static redirects to hyperkitty for. Redirecting to hyperkitty provides a better experience for users since it's integrated with the new archives rather than keeping a static set of pipermail archives around that will be missing the new messages.

This can be done after the migration, but might be easier to get done as lists get migrated.

Event Timeline

On IRC dpifke pointed me to https://httpd.apache.org/docs/current/rewrite/rewritemap.html which allows using a dbm database or shelling out to an external program.

akosiaris triaged this task as Medium priority.Apr 21 2021, 12:34 PM

Change 683108 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/software/pipermail-redirector@master] [WIP] Initial commit

https://gerrit.wikimedia.org/r/683108

Change 685723 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] mailman3: Script to generate pipermail redirects

https://gerrit.wikimedia.org/r/685723

the plan is to serve both mailman2 and mailman3 from the same server [...] FWIW the mailman mailing lists preserved their old pipermail archives and just left a note on top saying that the archives are no longer update

So the public HTML and txt.gz archives will keep being served indefinitely from the same URLs as before? That will reduce the pain considerably.

I mostly figured out how to redirect public HTML URLs, see T280731: Implement static redirects from pipermail archives to hyperkitty archives. Should get deployed tomorrow or early next week.

I don't think hyperkitty has an equivalent for the txt.gz files, so we can keep them around (by not deleting them) if there's value.

hyperkitty allows you to download gzip files (click on "Download" in https://lists.wikimedia.org/hyperkitty/list/listadmins@lists.wikimedia.org/2021/5/) Thankfully, building a redirect for it would be pretty easy. Doesn't need any hash.

Oooh! I totally missed those. Note that those are mbox files, not txt files so they have similar purposes but different formats. My understanding (from reading something in the Mailman3 bug tracker) is that MM2/pipermail provided plaintext downloads so people could search/grep/etc. through them. But hyperkitty has built-in search, so that's not a use case anymore. And for archival purposes, mbox files are much better than the txt files. So...should we redirect them? I think even though they're not 100% the same, redirects would be fine for most users.

Change 685723 merged by Legoktm:

[operations/puppet@production] mailman3: Script to generate pipermail redirects

https://gerrit.wikimedia.org/r/685723

I enabled redirects for mediawiki-debian and mediawiki-distributors. So far these only work for individual messages for now. New lists will automatically be done by the migration script, I'll back fill the redirects later.

legoktm@lists1001:/var/lib/mailman3/redirects$ sudo pipermail_redirects deutschschweiz --no-rebuild
Going through 2020-August...
Traceback (most recent call last):
  File "/usr/local/sbin/pipermail_redirects", line 141, in <module>
    sys.exit(main())
  File "/usr/local/sbin/pipermail_redirects", line 129, in main
    line = handle_email(listname, email)
  File "/usr/local/sbin/pipermail_redirects", line 76, in handle_email
    message_id = extract_in_reply_to(path)
  File "/usr/local/sbin/pipermail_redirects", line 52, in extract_in_reply_to
    text = path.read_text()
  File "/usr/lib/python3.7/pathlib.py", line 1200, in read_text
    return f.read()
  File "/usr/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 118: invalid start byte

Sigh.

I enabled redirects for everything that didn't cause unicode errors, currently 102,189 individual emails. 10% there!

The Message-ID is <4cc603b04121014385a1f4dee@mail.gmail.com> but the pipermail HTML file has it as BAY13-F285475D0DAC045FD9F550AB0A80@phx.gbl (which is actually the In-Reply-To/References).

There are actually 3 messages with that message ID listed:

root@lists1001:/var/lib/mailman/archives/public/wikien-l/2004-December# grep "BAY13-F285475D" *.html | grep LINK
017659.html:   <LINK REL="made" HREF="mailto:wikien-l%40lists.wikimedia.org?Subject=NPOV%20and%20credibility%20%28was%20Re%3A%20%5BWikiEN-l%5D%20Original%20research%29&In-Reply-To=BAY13-F285475D0DAC045FD9F550AB0A80%40phx.gbl">
017673.html:   <LINK REL="made" HREF="mailto:wikien-l%40lists.wikimedia.org?Subject=NPOV%20and%20credibility%20%28was%20Re%3A%20%5BWikiEN-l%5D%20Original%20research%29&In-Reply-To=BAY13-F285475D0DAC045FD9F550AB0A80%40phx.gbl">
017794.html:   <LINK REL="made" HREF="mailto:wikien-l%40lists.wikimedia.org?Subject=NPOV%20and%20credibility%20%28was%20Re%3A%20%5BWikiEN-l%5D%20Original%20research%29&In-Reply-To=BAY13-F285475D0DAC045FD9F550AB0A80%40phx.gbl">

Sigh.

It's basically impossible for me to figure out why these December 2004 archives are messed up. Maybe someone broke the archives or there was just a Mailman2 bug. Rebuilding the archives again from the mbox (which isn't broken) would theoretically fix it but that would likely also break all numbering in pipermail and isn't a viable option.

So...my plan is to now exclude duplicate Message-IDs. In theory we could use the hyperkitty API to fuzzy match on sender address for a bit of additional confidence, I might do that sometime later.

There's a secondary concern of auditing. I'd like to delete all the redirected pipermail HTML files, but how do we feel confident that the redirects are right? I did spot checks earlier and obviously missed that this kind of scenario even existed.

Change 690391 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] mailman3: Don't redirect pipermail messages with duplicate Message-IDs

https://gerrit.wikimedia.org/r/690391

OK, per https://bugs.launchpad.net/mailman/+bug/558263 and https://bugs.launchpad.net/mailman/+bug/266377 it seems In-Reply-To is unreliable and possibly wrong for all pre-2.1.10 messages. Might be time to come up with a new strategy... :(

Just noting some missing redirects, per discussion at Template talk:MediaWiki News:
Currently for me, in Template:MediaWiki_News:

That message's HTML archive has no In-Reply-To header set, so it's not possible to automatically determine the hyperkitty link (at least I'm not aware of how to do it!). My previous estimates suggested it would be around 5% of emails that end up in a state like this, and my plan was to just leave them as is and not redirect. If someone wanted to go around and find the hyperkitty links manually (with full-text search it's usually straightforward), I'd be happy to deploy those redirects. I just need a map of pipermail -> hyperkitty links.

Ah, I understand. I think your plan is acceptable - the old links still work, and the content is now available in the new system (and thus searchable ! !<3! !), so it's ok to leave as it is (unless someone volunteers to help do more, as you said). Thanks for the explanation!

Change 690391 merged by Legoktm:

[operations/puppet@production] mailman3: Don't redirect pipermail messages with duplicate Message-IDs

https://gerrit.wikimedia.org/r/690391

Mentioned in SAL (#wikimedia-operations) [2021-06-17T21:49:48Z] <legoktm> regenerating pipermail redirects to skip those with duplicate message-ids (T280731)

Legoktm claimed this task.

We currently have 620,937 redirects in place, which is much lower than my original estimate of 1 million, but I think it's probably the best we can do without some manual matching. If someone is interested in doing that, please let me know and we can figure something out.