Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Security | None | T181803 Stop storing Mailman passwords in plain text | ||
Resolved | None | T118641 Implement proper AAA for lists.wikimedia.org (mailman) | |||
Resolved | None | T190054 List archives on lists.wikimedia.org is not mobile friendly | |||
Resolved | None | T115329 "From" at start of line becomes ">From" in pipermail | |||
Resolved | None | T52864 Upgrade GNU Mailman from 2.1 to Mailman3 | |||
Resolved | • Ladsgroup | T278609 Import several public mailing lists archives from mailman2 to lists-next to measure database size |
Event Timeline
What about some of the lists that primarily do automatic emails:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs T262773: Stop archiving the wikidata-bugs mailinglist in pipermail
https://lists.wikimedia.org/mailman/listinfo/betacluster-alerts
https://lists.wikimedia.org/mailman/listinfo/discovery-alerts
It's a good idea but for example for case of wikidata-bugs there is no archive so it won't add much to archive tables for us to measure.
Change 675354 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/puppet@production] mailman3: Add rsync for mailman2 archives for importing
Change 675354 merged by Legoktm:
[operations/puppet@production] mailman3: Add rsync for mailman2 archives for importing
rsync is set up, per https://docs.mailman3.org/en/latest/migration.html we need the list config and mbox for importing.
legoktm@lists1002:~$ sudo /usr/bin/rsync -a rsync://lists1001.wikimedia.org/var-lib-mailman/lists/discovery-alerts . legoktm@lists1002:~$ sudo /usr/bin/rsync -a rsync://lists1001.wikimedia.org/var-lib-mailman/archives/private/discovery-alerts.mbox .
The discovery-alerts/config.pck and discovery-alerts.mbox are sitting in my home on lists1002, but I don't have time to try importing it right now. If you have time @Ladsgroup, feel free to give it a shot.
I tried and got this:
ladsgroup@lists1002:/home/legoktm$ sudo mailman import21 discovery-alerts@lists-next.wikimedia.org discovery-alerts/config.pck Traceback (most recent call last): File "/usr/bin/mailman", line 11, in <module> load_entry_point('mailman==3.2.1', 'console_scripts', 'mailman')() File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/mailman/bin/mailman.py", line 69, in invoke return super().invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/mailman/commands/cli_import.py", line 64, in import21 pickle_file, encoding='utf-8', errors='ignore') ModuleNotFoundError: No module named 'Mailman'
I assume it's because mailman2 is not installed in lists1002 but it's python3, and mailman2 is python2.
Mentioned in SAL (#wikimedia-operations) [2021-03-30T23:29:56Z] <Amir1> sudo django-admin hyperkitty_import -l discovery-alerts@lists-next.wikimedia.org discovery-alerts.mbox/discovery-alerts.mbox --pythonpath /usr/share/mailman3-web --settings settings (T278609)
@Ladsgroup found https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/JEPMB3HW4FI57EUMOST4L7BD2ILIIS3P/#E3UXFXT27WLRPQIR62KULKFO5SBH5NLT which led to https://gitlab.com/mailman/mailman/commit/2a15437e911660ab87f960ac3a9eba131a2b7350 - another patch we'll need to cherry-pick, sigh.
He was able to import the list archives though: https://lists-next.wikimedia.org/hyperkitty/list/discovery-alerts@lists-next.wikimedia.org/
<Amir1> it took long to migrate 4448 emails
If we need to have a workaround, we can just drop all bounce information from those mailing lists (also, probably we should fix bouncing too :D) beside this, there's another point: We should import archives but if we import mailing lists, we might also import the subscribers. We shouldn't do that :D
It cherry-picked cleanly, I'll upgrade lists-next tomorrow.
Oh uh, yeah.... I guess we can immediately put the list in emergency moderation mode to make sure no emails get sent? We really do need to test the import process though, as it clearly was already broken.
There's also https://lists.wikimedia.org/mailman/listinfo/test and https://lists.wikimedia.org/mailman/listinfo/test-second which we can use to make sure the import process doesn't fatal anymore. But I do want to test the import process of a real, human, mailing list.
I totally agree that we should import real mailing lists to make sure it works fine. Putting it on moderation seems fine to me.
Change 676479 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/puppet@production] mailman3: Avoid duplication in rsync definitions
I checked the db size and it seems it's indeed smaller:
- The mbox file for discovery alerts is: 14 MB
- The largest table for hyperkitty are hyperkitty_email (7.25MB) and hyperkitty_attachment (1.53 MB), in total less than 10MB. Note that this is all of mailing lists.
- The search index though is 17 MB (you can see it in /var/lib/mailman3/web/fulltext_index. Maybe we should avoid using whoosh an use xapian or something else. We can also avoid indexing all old emails (older than five years? I don't know if it's possible).
legoktm@lists1002:~$ du -hs discovery-alerts.mbox 14M discovery-alerts.mbox mysql:testmailman3web@m5-master.eqiad.wmnet [(none)]> SELECT -> table_schema 'Database Name', -> SUM(data_length + index_length) 'Size in Bytes', -> ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) 'Size in MiB' -> FROM information_schema.tables -> GROUP BY table_schema; +--------------------+---------------+-------------+ | Database Name | Size in Bytes | Size in MiB | +--------------------+---------------+-------------+ | information_schema | 196608 | 0.19 | | testmailman3web | 11911168 | 11.36 | +--------------------+---------------+-------------+ 2 rows in set (0.005 sec)
Note that testmailman3web also includes our various test lists too. We should import some more lists to see if this trend of using less disk space continues :)
Change 676479 merged by Legoktm:
[operations/puppet@production] mailman3: Avoid duplication in rsync definitions
legoktm@lists1002:~$ du -hs /var/lib/mailman3/web/*index 16M /var/lib/mailman3/web/fulltext_index 58M /var/lib/mailman3/web/fulltext_xapian_index
🤔 not super surprised that xapian will have a bigger index if it's supposed to have better search results. FWIW it took ~45s to do a full reindex with xapian (4483 mails).
qa-alerts and wikidata-bugs seem like good candidates to try next, though I would want DBA supervision if we're going to import 1.8G worth of data at once...I'm guessing mailman has no replag checking/delays.
Also one other consideration is attachments, which is what I think makes up most of reading-web-team, we can store those on disk if we wanted to, but I think we'd prefer to put them in the database so we have no strong filesystem dependencies (aside from search index).
I think we should be dropping archives of wikidata-bugs T262773: Stop archiving the wikidata-bugs mailinglist in pipermail, It's tiny in fs but migration and other work is a big mess.
One silver lining would be that lots of large mailing lists are large because they have lots of massive attachments and hopefully those won't get indexed.
Okay, I made a "discovery" which I I wish I didn't. It's fully clear to me yet. But I'm sure with migrating to mailman3, we would save a lot of space.
First:
- Mailman2 currently has used 175GB. That's a lot
Now:
- For public mailing lists, you have the public archives:
list@lists1001$ du -hs /var/lib/mailman/archives/public/wikidata-bugs/ 4.0G /var/lib/mailman/archives/public/wikidata-bugs/
- You also have its private archives, exactly the same:
list@lists100$ du -hs /var/lib/mailman/archives/private/wikidata-bugs/ 4.0G /var/lib/mailman/archives/private/wikidata-bugs
Nope, they are not symlinks:
list@lists1001$ file /var/lib/mailman/archives/private/wikidata-bugs/2019-September.txt.gz /var/lib/mailman/archives/private/wikidata-bugs/2019-September.txt.gz: gzip compressed data, was "2019-September.txt", last modified: Tue Oct 1 03:27:04 2019, max compression, original size 3583331 list@lists1001$ file /var/lib/mailman/archives/public/wikidata-bugs/2019-September.txt.gz /var/lib/mailman/archives/public/wikidata-bugs/2019-September.txt.gz: gzip compressed data, was "2019-September.txt", last modified: Tue Oct 1 03:27:04 2019, max compression, original size 3583331
We have them repeated in form of mbox as well:
list@lists100$ du -hs /var/lib/mailman/archives/private/wikidata-bugs.mbox 1.3G /var/lib/mailman/archives/private/wikidata-bugs.mbox
Beside mbox being repeated, you have them separately, text version and gzipped version:
list@lists1001:/var/lib/mailman/archives/public/wikidata-bugs$ ls | head 2012-April 2012-April.txt 2012-April.txt.gz 2012-August 2012-August.txt 2012-August.txt.gz 2012-December 2012-December.txt 2012-December.txt.gz 2012-July
It gets better, most modern clients send the html version of the email as an attachment, meaning the email also gets repeated again in form of an HTML attachment.
list@lists1001$ du -hs /var/lib/mailman/archives/private/wikidata-bugs/attachments/ 2.1G /var/lib/mailman/archives/private/wikidata-bugs/attachments/
so if my calculations are correct, for each public mailing list with archives enabled, it keeps nine copies of each email. One mbox, two gzip, two html, two txt, two raw.
One point: If we want add xapian indexing, it might cause storage issues on lists1001 during upgrade as lists1001 is already 65% full. Someone told me we can simply add more storage there but I don't remember how.
I imported pywikibot-bugs. Its mbox is 290MB and has 50,058 emails.
Importing subscribers, etc. was easy and was done in a minute:
ladsgroup@lists1002:~$ time sudo mailman import21 pywikibot-bugs@lists-next.wikimedia.org pywikibot-bugs/config.pck real 0m22.505s user 0m20.314s sys 0m0.307s
Importing archives took half an hour and has some warnings.
ladsgroup@lists1002:~$ time sudo django-admin hyperkitty_import -l pywikibot-bugs@lists-next.wikimedia.org pywikibot-bugs.mbox/pywikibot-bugs.mbox --pythonpath /usr/share/mailman3-web --settings settings Importing from mbox file pywikibot-bugs.mbox/pywikibot-bugs.mbox to pywikibot-bugs@lists-next.wikimedia.org 5%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x90\\x8C\\xB0\\xF0\\x90...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 7%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x90\\x8C\\xB8\\xF0\\x90...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 31%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x95 i...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 31%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x84\\x0A\\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 33%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x89\\x0A\\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 80%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x88 :...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 81%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x83\\x0A\\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 84%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x99\\x84)....' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 85%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\xA9\\xB5\\x8F&p...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 85%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, 'Incorrect string value: \'\\xF0\\xA9\\xB5\\x8F",...\' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1') return self.cursor.execute(query, args) 88%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x8E\\x89 )...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 92%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x90\\x8D\\x86\\xF0\\x90...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 94%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\xA4\\xA8)\\xE2...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 95%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x89\\x0A ...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) 98%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x86 \\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1") return self.cursor.execute(query, args) Computing thread structure Synchronizing properties with Mailman 42 emails left to refresh, checked 0 Warming up cache The full-text search index is not updated for this list. It will not be updated by the 'minutely' incremental update job. To update the index for this list, run the 'manage.py update_index_one_list pywikibot-bugs@lists-next.wikimedia.org' command. real 31m1.486s user 17m12.923s sys 1m49.000s
Indexing them took ten minutes:
ladsgroup@lists1002:~$ time sudo django-admin update_index_one_list pywikibot-bugs@lists-next.wikimedia.org --pythonpath /usr/share/mailman3-web --settings settings Indexing 50058 emails real 9m53.242s user 5m48.979s sys 0m37.031s
Index basically exploded and it's now 790MB, it used be 60MB but this mailing list is really heavy on text, hope large mailing lists that are mostly big because of large attachments shouldn't cause a big difference.
OTOH, database is pretty nice. It's only 180MB. Maybe DBAs can confirm their .ibd sizes on disk?
MariaDB [testmailman3web]> SELECT table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` FROM information_schema.TABLES order by `Size in MB` DESC limit 5; +----------------------------+------------+ | Table | Size in MB | +----------------------------+------------+ | hyperkitty_email | 113.59 | | hyperkitty_attachment | 56.02 | | hyperkitty_thread | 6.89 | | auth_user_user_permissions | 0.06 | | auth_user_groups | 0.06 | +----------------------------+------------+ 5 rows in set (0.009 sec)
Our mboxes are in total 45GB, if 300MB mbox turn into 180MB db (and 600MB search index), I assume we will have 27GB database and 90GB search index (worst case scenario). Maybe we can tune xapian a bit?
m5 went weeee during the import but no alerts were triggered https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1128&var-port=9104&var-dc=eqiad%20prometheus%2Fops&from=1617751232456&to=1617755026685
so I attempted to import wikitech-l and it basically failed the moment I started it. Running cleanrach on the mbox fixed lots of issues and then the importer managed to import around a couple of years of archives https://lists-next.wikimedia.org/hyperkitty/list/wikitech-l@lists-next.wikimedia.org/thread/G4C7ZDU2G3ROJECRJESAYST7QD7SROCF/ until 2004/2005 and it failed there, the --since option basically doesn't have any effect as it fails in processing the mbox altogether. It also doesn't look like a > From issue. It's encoding issue, trying to find and delete that line also failed miserably.
legoktm@lists1002:/home/ladsgroup$ ~/check_hk_import wikitech-l.mbox Failed to convert message number 14531, message=id 200501042046.00692.yann@forget-me.net to bytes 'ascii' codec can't encode character '\xa0' in position 143: ordinal not in range(128) Processed 94380 messages.
That's https://lists.wikimedia.org/pipermail/wikitech-l/2005-January/014563.html. I suppose we could manually edit the mbox to switch that mail from ISO-8859 to UTF-8? (tbh I am shocked that it only reported ONE failing message)
legoktm@lists1002:/home/ladsgroup$ diff wikitech-l.mbox.backup-lego wikitech-l.mbox 1070508c1070508 < charset="iso-8859-15" --- > charset="utf-8" 1070530c1070530 < Wikinews, Wikispecies and Wikimedia Commons logos and URLs�? --- > Wikinews, Wikispecies and Wikimedia Commons logos and URLs? 1070541c1070541 < http://fr.wikipedia.org/ | Encyclop�die libre --- > http://fr.wikipedia.org/ | Encyclopedie libre
And with that it now passes check_hk_import, and theoretically @Ladsgroup you should be able to resume the import.
Done:
ladsgroup@lists1002:~$ time sudo django-admin update_index_one_list wikitech-l@lists-next.wikimedia.org --pythonpath /usr/share/mailman3-web --settings settings Indexing 94200 emails real 33m4.736s user 14m37.062s sys 1m34.071s
Total index size:
3.0G fulltext_xapian_index
mbox:
ladsgroup@lists1002:~$ ls -Ssh wikitech-l.mbox 346M wikitech-l.mbox
Db tables:
MariaDB [testmailman3web]> SELECT table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` FROM information_schema.TABLES order by `Size in MB` DESC limit 5; +-------------------------------+------------+ | Table | Size in MB | +-------------------------------+------------+ | hyperkitty_email | 377.03 | | hyperkitty_attachment | 71.09 | | hyperkitty_thread | 14.39 | | hyperkitty_sender | 0.52 | | socialaccount_socialapp_sites | 0.06 | +-------------------------------+------------+ 5 rows in set (0.009 sec)
let's go with 500MB in total (if you can get a correct number from the files, that'd be amazing). So 650MB went to 500MB (and 3GB search index). Meaning database will be 34.6GB and 212 GB search index (worst case scenario)
Let's call it resolved. Wikitech-l is one of our oldest and biggest mailing lists and if we manage to do that, we probably can do the rest.