Phase #2: Convert archived Flow boards to wikitext
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	ppelberg
	Oct 12 2024, 12:06 AM

Related Objects
Search...

Status	Assigned	Task
Open	None	T335670 Disable Flow/Structured Discussions from all namespaces, except for user talk pages, on Catalan Wikipedia
Open	None	T106123 Extensions needing to be removed from Wikimedia wikis
Open	None	T332022 [Epic] Undeploying StructuredDiscussions (Flow)
Open	None	T377051 Phase #2: Convert archived Flow boards to wikitext

Event Timeline

ppelberg created this task.Oct 12 2024, 12:06 AM

Restricted Application added a project: Growth-Team. · View Herald TranscriptOct 12 2024, 12:06 AM

ppelberg mentioned this in T332022: [Epic] Undeploying StructuredDiscussions (Flow).Oct 12 2024, 12:06 AM

KStoller-WMF moved this task from Inbox to Triaged on the Growth-Team board.Oct 14 2024, 1:46 AM

ppelberg mentioned this in T370722: Set Flow and LQT sunsetting timeline and sequence.Nov 12 2024, 7:43 PM

I have been thinking about writing a script to convert Flow pages to wikitext pages while retaining the history. That's still in the idea stage, though, and not something that will be ready for a while.

Code for that script is at https://gitlab.wikimedia.org/pppery/flow-export-with-history

Right now only covers the header, but I tested it and it appears to work properly for that case.

I attempted to give it a try, but the very first line of the script threw

ModuleNotFoundError: No module named 'pymysql'

Could you please add a requirements.txt so that others can easily get started? (I didn’t open a MR because I don’t know which version are you using.) And maybe a .gitignore that ignores the venv directory (with a name of your choice), in case people don’t want to install packages globally.

I was developing it on PAWS, using whatever version of pymysql (and requests, which is the only other third-party package I used) is installed there, and following the instructions to connect to the wiki replicas at https://wikitech.wikimedia.org/wiki/News/2020_Wiki_Replicas_Redesign#How_should_I_connect_to_databases_in_PAWS%3F

There's also a bunch of hardcoded stuff there - the code is still very WIP as I said. You're still welcome to try it, though, and PRs to make the code less of a personal hack are welcome.

That script is now mostly done in terms of features. It converts the entire board and its history to wikitext.

Looking at the PAWS repo, they don’t seem to set any explicit dependency versions (I found pymysql in https://github.com/toolforge/paws/blob/main/images/singleuser/requirements.txt), so I just added them without any version constraints in !2. I mostly got it work locally, except that the very first page I tried to convert, Talk:Wikidata Bridge/Flow, failed with

Traceback (most recent call last):
  File ".../flow-export-with-history/script.py", line 522, in <module>
    main(sys.argv)
  File ".../flow-export-with-history/script.py", line 519, in main
    revs = convertBoard(page)
           ^^^^^^^^^^^^^^^^^^
  File ".../flow-export-with-history/script.py", line 505, in convertBoard
    revisions = mergeRevisions(revisions, convertTopic(root), revs1default=default)
                                          ^^^^^^^^^^^^^^^^^^
  File ".../flow-export-with-history/script.py", line 200, in convertTopic
    assert postID in blocks, ("%s should be in %s" % (postID, blocks))
           ^^^^^^^^^^^^^^^^
AssertionError: v369okc3k7nemhq9 should be in {'v2wpbgkiww7jwa16': {'body': '', 'children': [], 'signature': ''}}

(root path redacted from the stack trace; I added the message to the assert statement but haven’t committed it because I don’t know if it’s useful in general). Converting Project talk:Village Pump/Flow, on the other hand, went well. (The resulting XML mixes namespaced an non-namespaced elements, but MediaWiki seems to accept this mess when importing.)

The problem there is that https://www.mediawiki.org/w/index.php?title=Topic:V2wpbgkiww7jwa16 has more than 50 changes to it which my code to read the history of didn't handle.

Unforutnately the Flow API doesn't seem to support continuation cleanly. I just used an undocumented hack to up the limit to 5000 actions in open topic, so it now works. Carry on ...

The resulting XML mixes namespaced an non-namespaced elements, but MediaWiki seems to accept this mess when importing

The reason for that is that I do an export of the page to get a skeleton, which has namespaced elements, and then I modify it by adding non-namespaced elements. It's also missing some fields that a proper XML dump would have, like revision SHA1s. But I tested it and confirmed it imports how I want.

In T377051#10422932, @Pppery wrote:

The problem there is that https://www.mediawiki.org/w/index.php?title=Topic:V2wpbgkiww7jwa16 has more than 50 changes to it which my code to read the history of didn't handle.

Unforutnately the Flow API doesn't seem to support continuation cleanly. I just used an undocumented hack to up the limit to 5000 actions in open topic, so it now works. Carry on ...

So this means that it’ll still break in case there are over 5000 changes? Quite unlikely, but not impossible in very long/heated topics.

Also: I tried it out on Talk:Wikidata_Bridge/Flow, and it went well – except that it rejected the import because some revisions exceeded the $wgMaxArticleSize of 20 kiB. And that of 50 kiB. And that of 100 kiB. In the end, I had to increase the max size to 150 kiB, or 6.5 times more than the default. (And import from command line, because Special:Import crashed with out of memory error.) This excessive increase definitely won’t happen in production, so probably the script should split large pages into multiple archives. (I don’t know what to do if a single topic exceeds the limit…)

The resulting XML mixes namespaced an non-namespaced elements, but MediaWiki seems to accept this mess when importing

The reason for that is that I do an export of the page to get a skeleton, which has namespaced elements, and then I modify it by adding non-namespaced elements.

…and it looks like xml.etree.ElementTree can’t actually export trees with a default namespace – I tried to add the namespace to all elements that our script adds and then specifying default_namespace in the tostring() call, and it still complained about some version not having a namespace, which is probably the version attribute of the root element, and thus wouldn’t need any namespace (since attributes by default inherit the namespaces of their elements).

So this means that it’ll still break in case there are over 5000 changes? Quite unlikely, but not impossible in very long/heated topics.

Yes, but I'd prefer to cross that bridge if it actually happens.

In the end, I had to increase the max size to 150 kiB, or 6.5 times more than the default

Isn't $wgMaxArticleSize two megabytes by default and in production?

https://www.mediawiki.org/wiki/Manual:$wgMaxArticleSize:

Maximum page size in kibibytes.
Default value: 2048

2048 kibibytes is 2 mebibytes.

I do still need to find some way of splitting large pages (since I'm sure that the support desk is bordering on gigabytes of raw text). Probably the best approach is to split topics based on their last modification date, but it's possible that there are so many topics with large activity periods that that will still be too big.

In T377051#10423162, @Pppery wrote:

Isn't $wgMaxArticleSize two megabytes by default and in production?

Despite what the documentation says, it defaults to 20 KB in my localhost.

Maybe Docker is forcing the loading of includes/DevelopmentSettings.php without telling me. I see $wgMaxArticleSize = 20; in includes/DevelopmentSettings.php.

Yeah, that seems plausible.

I don't have a MediaWiki instance set up locally at the moment so can't confirm. Anyway this is not a bug in my script, just a reminder that talk pages can get large easily.

In T377051#10423162, @Pppery wrote:

So this means that it’ll still break in case there are over 5000 changes? Quite unlikely, but not impossible in very long/heated topics.

Yes, but I'd prefer to cross that bridge if it actually happens.

Okay, let’s see.

In T377051#10423192, @Novem_Linguae wrote:

Maybe Docker is forcing the loading of includes/DevelopmentSettings.php without telling me. I see $wgMaxArticleSize = 20; in includes/DevelopmentSettings.php.

I also use Docker. I was aware that DevelopmentSettings.php is automatically used in Docker (apparently via a PlatformSettings.php file, which in turn is included by the auto-generated LocalSettings.php), but I would’ve never thought that DevelopmentSettings.php changes the default max article size. (It was added in rMWc25380ca37b714af5fe4fadb1d4f0aaebde38180 – for tests, not for dev environments.)

In T377051#10423239, @Pppery wrote:

Anyway this is not a bug in my script, just a reminder that talk pages can get large easily.

Of course it’s not a bug – it’s a lack of a feature. While this page is well below the 2 MiB limit, pages like the village pump will probably exceed it by far. (I tried to run the script on Project:Village_Pump/Flow, but it threw RuntimeError: Found suspicious early history in /LQT Archive 1, so I don’t know how big actually it is.)

(I tried to run the script on Project:Village_Pump/Flow, but it threw RuntimeError: Found suspicious early history in /LQT Archive 1, so I don’t know how big actually it is.)

That should have been a warning not an error, and I've updated the repo to change it to one. (The cause it that it tries to merge the history of https://www.mediawiki.org/wiki/Project:Village_pump/LQT_Archive_1 with the history of the Flow header, and expected that page to always have {{#useliquidthreads:1}}, which it didn't due to IP vandalism.

I fixed that, which revealed a similar false-positive LQT paranoia check, and then another place where someone did something weird in the LQT era in 2011 which my code didn't expect which I had to work around, and then it finally proceeded ... until it crashed my PAWS kernel for (I presume) using too much memory. So very large indeed.

We have to choose a paradigm of how to split the history. There are several possible

Split each topic into its own page
Simulate cut-and-paste archiving by creating edits that remove topics the month (or year, or week, or other time period) after the last time they've been touched.
Simulate something similar to the way Wiktionary does things like https://en.wiktionary.org/wiki/Wiktionary:Beer_parlour - put all of the topics that were created in a given month (for example) on one page, and let the history of that page include all of the changes made to those topics even after that month is over.

I'm inclined to do the third, as least awkward, and avoiding introducing arbitrary archive point thresholds, even though it differs from the way wikitext discussions are typically archived.

Many wikis has discussion split by month (or three months/half year), but archives of English Wikipedia Village Pump and Administrators' noticeboard are split by constant size.

Phase #2: Convert archived Flow boards to wikitextOpen, Needs TriagePublicActions

Related ObjectsSearch...

Event Timeline

Phase #2: Convert archived Flow boards to wikitext
Open, Needs TriagePublic
Actions

Related Objects
Search...