Page MenuHomePhabricator

Phase #2: Convert archived Flow boards to wikitext
Open, Needs TriagePublic

Event Timeline

I have been thinking about writing a script to convert Flow pages to wikitext pages while retaining the history. That's still in the idea stage, though, and not something that will be ready for a while.

Code for that script is at https://gitlab.wikimedia.org/pppery/flow-export-with-history

Right now only covers the header, but I tested it and it appears to work properly for that case.

I attempted to give it a try, but the very first line of the script threw

ModuleNotFoundError: No module named 'pymysql'

Could you please add a requirements.txt so that others can easily get started? (I didn’t open a MR because I don’t know which version are you using.) And maybe a .gitignore that ignores the venv directory (with a name of your choice), in case people don’t want to install packages globally.

I was developing it on PAWS, using whatever version of pymysql (and requests, which is the only other third-party package I used) is installed there, and following the instructions to connect to the wiki replicas at https://wikitech.wikimedia.org/wiki/News/2020_Wiki_Replicas_Redesign#How_should_I_connect_to_databases_in_PAWS%3F

There's also a bunch of hardcoded stuff there - the code is still very WIP as I said. You're still welcome to try it, though, and PRs to make the code less of a personal hack are welcome.

That script is now mostly done in terms of features. It converts the entire board and its history to wikitext.

Looking at the PAWS repo, they don’t seem to set any explicit dependency versions (I found pymysql in https://github.com/toolforge/paws/blob/main/images/singleuser/requirements.txt), so I just added them without any version constraints in !2. I mostly got it work locally, except that the very first page I tried to convert, Talk:Wikidata Bridge/Flow, failed with

Traceback (most recent call last):
  File ".../flow-export-with-history/script.py", line 522, in <module>
    main(sys.argv)
  File ".../flow-export-with-history/script.py", line 519, in main
    revs = convertBoard(page)
           ^^^^^^^^^^^^^^^^^^
  File ".../flow-export-with-history/script.py", line 505, in convertBoard
    revisions = mergeRevisions(revisions, convertTopic(root), revs1default=default)
                                          ^^^^^^^^^^^^^^^^^^
  File ".../flow-export-with-history/script.py", line 200, in convertTopic
    assert postID in blocks, ("%s should be in %s" % (postID, blocks))
           ^^^^^^^^^^^^^^^^
AssertionError: v369okc3k7nemhq9 should be in {'v2wpbgkiww7jwa16': {'body': '', 'children': [], 'signature': ''}}

(root path redacted from the stack trace; I added the message to the assert statement but haven’t committed it because I don’t know if it’s useful in general). Converting Project talk:Village Pump/Flow, on the other hand, went well. (The resulting XML mixes namespaced an non-namespaced elements, but MediaWiki seems to accept this mess when importing.)

The problem there is that https://www.mediawiki.org/w/index.php?title=Topic:V2wpbgkiww7jwa16 has more than 50 changes to it which my code to read the history of didn't handle.

Unforutnately the Flow API doesn't seem to support continuation cleanly. I just used an undocumented hack to up the limit to 5000 actions in open topic, so it now works. Carry on ...

The resulting XML mixes namespaced an non-namespaced elements, but MediaWiki seems to accept this mess when importing

The reason for that is that I do an export of the page to get a skeleton, which has namespaced elements, and then I modify it by adding non-namespaced elements. It's also missing some fields that a proper XML dump would have, like revision SHA1s. But I tested it and confirmed it imports how I want.

The problem there is that https://www.mediawiki.org/w/index.php?title=Topic:V2wpbgkiww7jwa16 has more than 50 changes to it which my code to read the history of didn't handle.

Unforutnately the Flow API doesn't seem to support continuation cleanly. I just used an undocumented hack to up the limit to 5000 actions in open topic, so it now works. Carry on ...

So this means that it’ll still break in case there are over 5000 changes? Quite unlikely, but not impossible in very long/heated topics.

Also: I tried it out on Talk:Wikidata_Bridge/Flow, and it went well – except that it rejected the import because some revisions exceeded the $wgMaxArticleSize of 20 kiB. And that of 50 kiB. And that of 100 kiB. In the end, I had to increase the max size to 150 kiB, or 6.5 times more than the default. (And import from command line, because Special:Import crashed with out of memory error.) This excessive increase definitely won’t happen in production, so probably the script should split large pages into multiple archives. (I don’t know what to do if a single topic exceeds the limit…)

The resulting XML mixes namespaced an non-namespaced elements, but MediaWiki seems to accept this mess when importing

The reason for that is that I do an export of the page to get a skeleton, which has namespaced elements, and then I modify it by adding non-namespaced elements.

…and it looks like xml.etree.ElementTree can’t actually export trees with a default namespace – I tried to add the namespace to all elements that our script adds and then specifying default_namespace in the tostring() call, and it still complained about some version not having a namespace, which is probably the version attribute of the root element, and thus wouldn’t need any namespace (since attributes by default inherit the namespaces of their elements).

So this means that it’ll still break in case there are over 5000 changes? Quite unlikely, but not impossible in very long/heated topics.

Yes, but I'd prefer to cross that bridge if it actually happens.

In the end, I had to increase the max size to 150 kiB, or 6.5 times more than the default

Isn't $wgMaxArticleSize two megabytes by default and in production?

https://www.mediawiki.org/wiki/Manual:$wgMaxArticleSize:

Maximum page size in kibibytes.
Default value: 2048

2048 kibibytes is 2 mebibytes.

I do still need to find some way of splitting large pages (since I'm sure that the support desk is bordering on gigabytes of raw text). Probably the best approach is to split topics based on their last modification date, but it's possible that there are so many topics with large activity periods that that will still be too big.

Isn't $wgMaxArticleSize two megabytes by default and in production?

Despite what the documentation says, it defaults to 20 KB in my localhost.

Maybe Docker is forcing the loading of includes/DevelopmentSettings.php without telling me. I see $wgMaxArticleSize = 20; in includes/DevelopmentSettings.php.

Yeah, that seems plausible.

I don't have a MediaWiki instance set up locally at the moment so can't confirm. Anyway this is not a bug in my script, just a reminder that talk pages can get large easily.

So this means that it’ll still break in case there are over 5000 changes? Quite unlikely, but not impossible in very long/heated topics.

Yes, but I'd prefer to cross that bridge if it actually happens.

Okay, let’s see.

Maybe Docker is forcing the loading of includes/DevelopmentSettings.php without telling me. I see $wgMaxArticleSize = 20; in includes/DevelopmentSettings.php.

I also use Docker. I was aware that DevelopmentSettings.php is automatically used in Docker (apparently via a PlatformSettings.php file, which in turn is included by the auto-generated LocalSettings.php), but I would’ve never thought that DevelopmentSettings.php changes the default max article size. (It was added in rMWc25380ca37b714af5fe4fadb1d4f0aaebde38180 – for tests, not for dev environments.)

Anyway this is not a bug in my script, just a reminder that talk pages can get large easily.

Of course it’s not a bug – it’s a lack of a feature. While this page is well below the 2 MiB limit, pages like the village pump will probably exceed it by far. (I tried to run the script on Project:Village_Pump/Flow, but it threw RuntimeError: Found suspicious early history in /LQT Archive 1, so I don’t know how big actually it is.)

(I tried to run the script on Project:Village_Pump/Flow, but it threw RuntimeError: Found suspicious early history in /LQT Archive 1, so I don’t know how big actually it is.)

That should have been a warning not an error, and I've updated the repo to change it to one. (The cause it that it tries to merge the history of https://www.mediawiki.org/wiki/Project:Village_pump/LQT_Archive_1 with the history of the Flow header, and expected that page to always have {{#useliquidthreads:1}}, which it didn't due to IP vandalism.

I fixed that, which revealed a similar false-positive LQT paranoia check, and then another place where someone did something weird in the LQT era in 2011 which my code didn't expect which I had to work around, and then it finally proceeded ... until it crashed my PAWS kernel for (I presume) using too much memory. So very large indeed.


We have to choose a paradigm of how to split the history. There are several possible

  • Split each topic into its own page
  • Simulate cut-and-paste archiving by creating edits that remove topics the month (or year, or week, or other time period) after the last time they've been touched.
  • Simulate something similar to the way Wiktionary does things like https://en.wiktionary.org/wiki/Wiktionary:Beer_parlour - put all of the topics that were created in a given month (for example) on one page, and let the history of that page include all of the changes made to those topics even after that month is over.

I'm inclined to do the third, as least awkward, and avoiding introducing arbitrary archive point thresholds, even though it differs from the way wikitext discussions are typically archived.

Many wikis has discussion split by month (or three months/half year), but archives of English Wikipedia Village Pump and Administrators' noticeboard are split by constant size.

I'm in the process of updating the code to optionally split topics by the year they were first created (the third option). This produces a reasonable output size for the village pump, but the support desk is still far too large and would need its history split by months to be at a reasonable size.

The code now supports passing a year parameter to only export the history of topics that started in one specific year.

I've spent the last few weeks ironing out the kinks in that script (some of which are listed at https://www.mediawiki.org/wiki/User:Flow_cleanup_bot), and have gotten to the point that I'm ready to do the mass run on MediaWiki.org after a final few weeks for last-minute comments.

After I run it then, any other wiki is welcome to ask me to run the script (which requires the bot be granted admin and XML importer rights), or they're welcome to not do so and let the WMF do whatever it's going to do.

Sometime much later I plan to write and run another bot to fix "Topic:Foo" links to their appropriate location in the converted Flow page. That's still in the conceptual stage, though.

@Pppery great work so far. Thank you so much !

I have now done the main bot run on MediaWiki.org. I'm currently manually processing pages that are too large to import at once, or have broken syntax, or have some other problem that the script deferred for manual processing.

Not Flow cleanup bot's fault - you can see three barnstars at https://www.mediawiki.org/wiki/User_talk:IKhitron/Flow too. I just faithfully represented what happened.

Wow. My bad. So I have six now. Didn't know you preserve the originals, so I could check.

That manual processing is now finished. So at this point I've exported every Flow board on MediaWIki.org (although it's vaguely possible one got missed in the shuffle somehow ...)

Something that’s occurred to me is that - if the history of Flow boards isn’t preserved/made accessible somewhere (either through a wikitext conversion, or through something like T389680) - there’s a real risk that some topics/posts that are no longer immediately visible on a Flow board (e.g., that were posted and then removed/'hidden' by a non-admin) will simply be lost to history be made much harder to access. (comment updated after @Pppery pointed out the Flow dumps below, ty!)

The example that’s in my mind right now is the case of user talk pages — it’s my experience that editors are generally permitted to remove posts from their own user talk page after they’ve read them. Therefore, if the history/content of previous (hidden) Flow revisions isn’t preserved somehow, any messages that have been left for a user and then 'hidden' by that user after-the-fact (as an acceptable way of clearing their talk page) could potentially be lost forever without dealing with historical Flow dumps. (In a potential/hypothetical case where a user removes/'hides' all messages posted to their talk page as a matter of practice, this could result in the loss of all previous messages left on this user’s talk page.)

Flow cleanup bot exports, of course, include history, including hidden comments (but not deleted comments). Probably nothing else will - my idea for T389680 was to naively screenscrape all of the 100,000 topics as viewed by a logged-out user, which would not include the hidden comments. Of course, a slightly smarter screenscrape could be done there, but I'm inclined to not care.

(The content won't truly be lost since Flow has good dump coverage, but that's little help here)

The comment above conflated hidden *topics* and hidden *posts* - hidden posts are included in Flow cleanup bot exports, and won't be included in static-flow; hidden topics aren't included in Flow cleanup bot exports (because the Flow API seems to provide no way of listing them - there isn't an API for "view board history" and the view-topiclist API doesn't include them) but will be included in static-flow if you guess the permalink somehow. There will be no way of finding them, though, other than digging through dumps. All of this is unideal but I've basically expended all of my motivation to deal with these edge cases by now

(For the record, I did later dig through the dumps and add hidden topics on MediaWiki.org. The process is tedious and very manual, though, so it probably won't be done for other wikis unless someone really cares. So I think after way too much effort MediaWiki.org is as done as it's going to be - but T389680 should still be done before any actual undeployments)