Page MenuHomePhabricator

Import several public mailing lists archives from mailman2 to lists-next to measure database size
Closed, ResolvedPublic

Event Timeline

It's a good idea but for example for case of wikidata-bugs there is no archive so it won't add much to archive tables for us to measure.

Legoktm renamed this task from Migrate several public mailing lists of the old mailman to lists-next to measure database size to Import several public mailing lists archives from mailman2 to lists-next to measure database size.Mar 29 2021, 5:48 AM

Change 675354 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/puppet@production] mailman3: Add rsync for mailman2 archives for importing

https://gerrit.wikimedia.org/r/675354

jijiki triaged this task as Medium priority.Mar 29 2021, 9:14 PM

Change 675354 merged by Legoktm:

[operations/puppet@production] mailman3: Add rsync for mailman2 archives for importing

https://gerrit.wikimedia.org/r/675354

rsync is set up, per https://docs.mailman3.org/en/latest/migration.html we need the list config and mbox for importing.

legoktm@lists1002:~$ sudo /usr/bin/rsync  -a  rsync://lists1001.wikimedia.org/var-lib-mailman/lists/discovery-alerts .
legoktm@lists1002:~$ sudo /usr/bin/rsync  -a  rsync://lists1001.wikimedia.org/var-lib-mailman/archives/private/discovery-alerts.mbox .

The discovery-alerts/config.pck and discovery-alerts.mbox are sitting in my home on lists1002, but I don't have time to try importing it right now. If you have time @Ladsgroup, feel free to give it a shot.

I tried and got this:

ladsgroup@lists1002:/home/legoktm$ sudo mailman import21 discovery-alerts@lists-next.wikimedia.org discovery-alerts/config.pck
Traceback (most recent call last):
  File "/usr/bin/mailman", line 11, in <module>
    load_entry_point('mailman==3.2.1', 'console_scripts', 'mailman')()
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/mailman/bin/mailman.py", line 69, in invoke
    return super().invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/mailman/commands/cli_import.py", line 64, in import21
    pickle_file, encoding='utf-8', errors='ignore')
ModuleNotFoundError: No module named 'Mailman'

I assume it's because mailman2 is not installed in lists1002 but it's python3, and mailman2 is python2.

Mentioned in SAL (#wikimedia-operations) [2021-03-30T23:29:56Z] <Amir1> sudo django-admin hyperkitty_import -l discovery-alerts@lists-next.wikimedia.org discovery-alerts.mbox/discovery-alerts.mbox --pythonpath /usr/share/mailman3-web --settings settings (T278609)

If we need to have a workaround, we can just drop all bounce information from those mailing lists (also, probably we should fix bouncing too :D) beside this, there's another point: We should import archives but if we import mailing lists, we might also import the subscribers. We shouldn't do that :D

It cherry-picked cleanly, I'll upgrade lists-next tomorrow.

If we need to have a workaround, we can just drop all bounce information from those mailing lists (also, probably we should fix bouncing too :D) beside this, there's another point: We should import archives but if we import mailing lists, we might also import the subscribers. We shouldn't do that :D

Oh uh, yeah.... I guess we can immediately put the list in emergency moderation mode to make sure no emails get sent? We really do need to test the import process though, as it clearly was already broken.

There's also https://lists.wikimedia.org/mailman/listinfo/test and https://lists.wikimedia.org/mailman/listinfo/test-second which we can use to make sure the import process doesn't fatal anymore. But I do want to test the import process of a real, human, mailing list.

It cherry-picked cleanly, I'll upgrade lists-next tomorrow.

If we need to have a workaround, we can just drop all bounce information from those mailing lists (also, probably we should fix bouncing too :D) beside this, there's another point: We should import archives but if we import mailing lists, we might also import the subscribers. We shouldn't do that :D

Oh uh, yeah.... I guess we can immediately put the list in emergency moderation mode to make sure no emails get sent? We really do need to test the import process though, as it clearly was already broken.

There's also https://lists.wikimedia.org/mailman/listinfo/test and https://lists.wikimedia.org/mailman/listinfo/test-second which we can use to make sure the import process doesn't fatal anymore. But I do want to test the import process of a real, human, mailing list.

I totally agree that we should import real mailing lists to make sure it works fine. Putting it on moderation seems fine to me.

Change 676479 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] mailman3: Avoid duplication in rsync definitions

https://gerrit.wikimedia.org/r/676479

I checked the db size and it seems it's indeed smaller:

  • The mbox file for discovery alerts is: 14 MB
  • The largest table for hyperkitty are hyperkitty_email (7.25MB) and hyperkitty_attachment (1.53 MB), in total less than 10MB. Note that this is all of mailing lists.
  • The search index though is 17 MB (you can see it in /var/lib/mailman3/web/fulltext_index. Maybe we should avoid using whoosh an use xapian or something else. We can also avoid indexing all old emails (older than five years? I don't know if it's possible).
legoktm@lists1002:~$ du -hs discovery-alerts.mbox
14M	discovery-alerts.mbox
mysql:testmailman3web@m5-master.eqiad.wmnet [(none)]> SELECT 
    ->     table_schema 'Database Name',
    ->     SUM(data_length + index_length) 'Size in Bytes',
    ->     ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) 'Size in MiB'
    -> FROM information_schema.tables 
    -> GROUP BY table_schema;
+--------------------+---------------+-------------+
| Database Name      | Size in Bytes | Size in MiB |
+--------------------+---------------+-------------+
| information_schema |        196608 |        0.19 |
| testmailman3web    |      11911168 |       11.36 |
+--------------------+---------------+-------------+
2 rows in set (0.005 sec)

Note that testmailman3web also includes our various test lists too. We should import some more lists to see if this trend of using less disk space continues :)

Change 676479 merged by Legoktm:

[operations/puppet@production] mailman3: Avoid duplication in rsync definitions

https://gerrit.wikimedia.org/r/676479

  • The search index though is 17 MB (you can see it in /var/lib/mailman3/web/fulltext_index. Maybe we should avoid using whoosh an use xapian or something else. We can also avoid indexing all old emails (older than five years? I don't know if it's possible).
legoktm@lists1002:~$ du -hs /var/lib/mailman3/web/*index
16M	/var/lib/mailman3/web/fulltext_index
58M	/var/lib/mailman3/web/fulltext_xapian_index

🤔 not super surprised that xapian will have a bigger index if it's supposed to have better search results. FWIW it took ~45s to do a full reindex with xapian (4483 mails).

1root@lists1001:/var/lib/mailman/archives/private# du -hs *.mbox | sort -hr | head -n 20
26.5G reading-web-team.mbox
33.1G wmfall.mbox
42.5G arbcom-l.mbox
51.8G wmfsf.mbox
61.3G wikidata-bugs.mbox
71.1G wikinews-pl.mbox
81.1G wikimedia-in-exec.mbox
9890M ops.mbox
10782M wikimedia-il.mbox
11741M wikimediail-board.mbox
12618M affcom.mbox
13597M blog-admin.mbox
14582M traffic-anomaly-report.mbox
15581M movecom.mbox
16563M wmfcc-l.mbox
17552M reading-wmf.mbox
18539M analytics-internal.mbox
19519M functionaries-en.mbox
20492M qa-alerts.mbox
21453M wikimedia-l.mbox

qa-alerts and wikidata-bugs seem like good candidates to try next, though I would want DBA supervision if we're going to import 1.8G worth of data at once...I'm guessing mailman has no replag checking/delays.

Also one other consideration is attachments, which is what I think makes up most of reading-web-team, we can store those on disk if we wanted to, but I think we'd prefer to put them in the database so we have no strong filesystem dependencies (aside from search index).

I think we should be dropping archives of wikidata-bugs T262773: Stop archiving the wikidata-bugs mailinglist in pipermail, It's tiny in fs but migration and other work is a big mess.

  • The search index though is 17 MB (you can see it in /var/lib/mailman3/web/fulltext_index. Maybe we should avoid using whoosh an use xapian or something else. We can also avoid indexing all old emails (older than five years? I don't know if it's possible).
legoktm@lists1002:~$ du -hs /var/lib/mailman3/web/*index
16M	/var/lib/mailman3/web/fulltext_index
58M	/var/lib/mailman3/web/fulltext_xapian_index

🤔 not super surprised that xapian will have a bigger index if it's supposed to have better search results. FWIW it took ~45s to do a full reindex with xapian (4483 mails).

It'll be massive when we migrate everything :(

One silver lining would be that lots of large mailing lists are large because they have lots of massive attachments and hopefully those won't get indexed.

Okay, I made a "discovery" which I I wish I didn't. It's fully clear to me yet. But I'm sure with migrating to mailman3, we would save a lot of space.
First:

  • Mailman2 currently has used 175GB. That's a lot

Now:

  • For public mailing lists, you have the public archives:
list@lists1001$ du -hs /var/lib/mailman/archives/public/wikidata-bugs/
4.0G	/var/lib/mailman/archives/public/wikidata-bugs/
  • You also have its private archives, exactly the same:
list@lists100$ du -hs /var/lib/mailman/archives/private/wikidata-bugs/
4.0G	/var/lib/mailman/archives/private/wikidata-bugs

Nope, they are not symlinks:

list@lists1001$ file /var/lib/mailman/archives/private/wikidata-bugs/2019-September.txt.gz
/var/lib/mailman/archives/private/wikidata-bugs/2019-September.txt.gz: gzip compressed data, was "2019-September.txt", last modified: Tue Oct  1 03:27:04 2019, max compression, original size 3583331
list@lists1001$ file /var/lib/mailman/archives/public/wikidata-bugs/2019-September.txt.gz
/var/lib/mailman/archives/public/wikidata-bugs/2019-September.txt.gz: gzip compressed data, was "2019-September.txt", last modified: Tue Oct  1 03:27:04 2019, max compression, original size 3583331

We have them repeated in form of mbox as well:

list@lists100$ du -hs /var/lib/mailman/archives/private/wikidata-bugs.mbox
1.3G	/var/lib/mailman/archives/private/wikidata-bugs.mbox

Beside mbox being repeated, you have them separately, text version and gzipped version:

list@lists1001:/var/lib/mailman/archives/public/wikidata-bugs$ ls | head
2012-April
2012-April.txt
2012-April.txt.gz
2012-August
2012-August.txt
2012-August.txt.gz
2012-December
2012-December.txt
2012-December.txt.gz
2012-July

It gets better, most modern clients send the html version of the email as an attachment, meaning the email also gets repeated again in form of an HTML attachment.

list@lists1001$ du -hs /var/lib/mailman/archives/private/wikidata-bugs/attachments/
2.1G	/var/lib/mailman/archives/private/wikidata-bugs/attachments/

so if my calculations are correct, for each public mailing list with archives enabled, it keeps nine copies of each email. One mbox, two gzip, two html, two txt, two raw.

One point: If we want add xapian indexing, it might cause storage issues on lists1001 during upgrade as lists1001 is already 65% full. Someone told me we can simply add more storage there but I don't remember how.

I imported pywikibot-bugs. Its mbox is 290MB and has 50,058 emails.

Importing subscribers, etc. was easy and was done in a minute:

ladsgroup@lists1002:~$ time sudo mailman import21 pywikibot-bugs@lists-next.wikimedia.org pywikibot-bugs/config.pck

real	0m22.505s
user	0m20.314s
sys	0m0.307s

Importing archives took half an hour and has some warnings.

ladsgroup@lists1002:~$ time sudo django-admin hyperkitty_import -l pywikibot-bugs@lists-next.wikimedia.org pywikibot-bugs.mbox/pywikibot-bugs.mbox --pythonpath /usr/share/mailman3-web --settings settings
Importing from mbox file pywikibot-bugs.mbox/pywikibot-bugs.mbox to pywikibot-bugs@lists-next.wikimedia.org
5%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x90\\x8C\\xB0\\xF0\\x90...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
7%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x90\\x8C\\xB8\\xF0\\x90...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
31%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x95 i...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
31%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x84\\x0A\\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
33%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x89\\x0A\\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
80%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x88 :...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
81%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x83\\x0A\\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
84%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x99\\x84)....' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
85%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\xA9\\xB5\\x8F&p...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
85%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, 'Incorrect string value: \'\\xF0\\xA9\\xB5\\x8F",...\' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1')
  return self.cursor.execute(query, args)
88%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x8E\\x89 )...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
92%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x90\\x8D\\x86\\xF0\\x90...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
94%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\xA4\\xA8)\\xE2...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
95%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x89\\x0A ...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
98%/usr/lib/python3/dist-packages/django/db/backends/mysql/base.py:101: Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x86 \\x0A...' for column `testmailman3web`.`hyperkitty_email`.`content` at row 1")
  return self.cursor.execute(query, args)
Computing thread structure
Synchronizing properties with Mailman
42 emails left to refresh, checked 0
Warming up cache
The full-text search index is not updated for this list. It will not be updated by the 'minutely' incremental update job. To update the index for this list, run the 'manage.py update_index_one_list pywikibot-bugs@lists-next.wikimedia.org' command.

real	31m1.486s
user	17m12.923s
sys	1m49.000s

Indexing them took ten minutes:

ladsgroup@lists1002:~$ time sudo django-admin update_index_one_list pywikibot-bugs@lists-next.wikimedia.org --pythonpath /usr/share/mailman3-web --settings settings
Indexing 50058 emails

real	9m53.242s
user	5m48.979s
sys	0m37.031s

Index basically exploded and it's now 790MB, it used be 60MB but this mailing list is really heavy on text, hope large mailing lists that are mostly big because of large attachments shouldn't cause a big difference.

OTOH, database is pretty nice. It's only 180MB. Maybe DBAs can confirm their .ibd sizes on disk?

MariaDB [testmailman3web]> SELECT      table_name AS `Table`,      round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`  FROM information_schema.TABLES order by `Size in MB` DESC limit 5;
+----------------------------+------------+
| Table                      | Size in MB |
+----------------------------+------------+
| hyperkitty_email           |     113.59 |
| hyperkitty_attachment      |      56.02 |
| hyperkitty_thread          |       6.89 |
| auth_user_user_permissions |       0.06 |
| auth_user_groups           |       0.06 |
+----------------------------+------------+
5 rows in set (0.009 sec)

Our mboxes are in total 45GB, if 300MB mbox turn into 180MB db (and 600MB search index), I assume we will have 27GB database and 90GB search index (worst case scenario). Maybe we can tune xapian a bit?

m5 went weeee during the import but no alerts were triggered https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1128&var-port=9104&var-dc=eqiad%20prometheus%2Fops&from=1617751232456&to=1617755026685

so I attempted to import wikitech-l and it basically failed the moment I started it. Running cleanrach on the mbox fixed lots of issues and then the importer managed to import around a couple of years of archives https://lists-next.wikimedia.org/hyperkitty/list/wikitech-l@lists-next.wikimedia.org/thread/G4C7ZDU2G3ROJECRJESAYST7QD7SROCF/ until 2004/2005 and it failed there, the --since option basically doesn't have any effect as it fails in processing the mbox altogether. It also doesn't look like a > From issue. It's encoding issue, trying to find and delete that line also failed miserably.

so I attempted to import wikitech-l and it basically failed the moment I started it. Running cleanrach on the mbox fixed lots of issues and then the importer managed to import around a couple of years of archives https://lists-next.wikimedia.org/hyperkitty/list/wikitech-l@lists-next.wikimedia.org/thread/G4C7ZDU2G3ROJECRJESAYST7QD7SROCF/ until 2004/2005 and it failed there, the --since option basically doesn't have any effect as it fails in processing the mbox altogether. It also doesn't look like a > From issue. It's encoding issue, trying to find and delete that line also failed miserably.

legoktm@lists1002:/home/ladsgroup$ ~/check_hk_import wikitech-l.mbox
Failed to convert message number 14531, message=id 200501042046.00692.yann@forget-me.net to bytes
    'ascii' codec can't encode character '\xa0' in position 143: ordinal not in range(128)
Processed 94380 messages.

That's https://lists.wikimedia.org/pipermail/wikitech-l/2005-January/014563.html. I suppose we could manually edit the mbox to switch that mail from ISO-8859 to UTF-8? (tbh I am shocked that it only reported ONE failing message)

legoktm@lists1002:/home/ladsgroup$ diff wikitech-l.mbox.backup-lego wikitech-l.mbox
1070508c1070508
< 	charset="iso-8859-15"
---
> 	charset="utf-8"
1070530c1070530
< Wikinews, Wikispecies and Wikimedia Commons logos and URLs�?
---
> Wikinews, Wikispecies and Wikimedia Commons logos and URLs?
1070541c1070541
< http://fr.wikipedia.org/ | Encyclop�die libre
---
> http://fr.wikipedia.org/ | Encyclopedie libre

And with that it now passes check_hk_import, and theoretically @Ladsgroup you should be able to resume the import.

Oh thanks. you rock. Importing now.

Done:

ladsgroup@lists1002:~$ time sudo django-admin update_index_one_list wikitech-l@lists-next.wikimedia.org --pythonpath /usr/share/mailman3-web --settings settings
Indexing 94200 emails

real	33m4.736s
user	14m37.062s
sys	1m34.071s

Total index size:

3.0G	fulltext_xapian_index

mbox:

ladsgroup@lists1002:~$ ls -Ssh wikitech-l.mbox
346M wikitech-l.mbox

Db tables:

MariaDB [testmailman3web]> SELECT      table_name AS `Table`,      round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`  FROM information_schema.TABLES order by `Size in MB` DESC limit 5;
+-------------------------------+------------+
| Table                         | Size in MB |
+-------------------------------+------------+
| hyperkitty_email              |     377.03 |
| hyperkitty_attachment         |      71.09 |
| hyperkitty_thread             |      14.39 |
| hyperkitty_sender             |       0.52 |
| socialaccount_socialapp_sites |       0.06 |
+-------------------------------+------------+
5 rows in set (0.009 sec)

let's go with 500MB in total (if you can get a correct number from the files, that'd be amazing). So 650MB went to 500MB (and 3GB search index). Meaning database will be 34.6GB and 212 GB search index (worst case scenario)

Ladsgroup claimed this task.

Let's call it resolved. Wikitech-l is one of our oldest and biggest mailing lists and if we manage to do that, we probably can do the rest.