Page MenuHomePhabricator

Inform ptwikibooks of LQT and Flow removal plan and timing
Closed, ResolvedPublic

Assigned To
Authored By
Trizek-WMF
Mar 6 2025, 9:20 AM
Referenced Files
F65817508: lqt.json.bz2
Aug 21 2025, 1:45 PM
F62352975: proposed-redirects.json
Jun 16 2025, 5:51 PM
F62352467: redirects.json.bz2
Jun 16 2025, 4:45 PM
F62314766: proposed-redirects.json
Jun 13 2025, 5:13 PM
F62314410: proposed-redirects.json
Jun 13 2025, 4:57 PM
F62305157: topics.txt
Jun 12 2025, 6:12 PM
F62305155: extract-topics.ipynb
Jun 12 2025, 6:12 PM
F62304618: proposed-redirects.json.bz2
Jun 12 2025, 4:58 PM

Description

Portuguese Wikibooks has Flow and LiquidThreads installed.

We will deprecate LQT by:

  • Repairing redirects to threads which had already been ported to Flow in 2015
  • Restoring LQT boards where LQT had been disabled in the wikitext
  • Porting the pages to Flow
  • Archiving those Flow pages and switching to Structured Discussions.
Steps :

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
OpenNone
OpenNone
ResolvedTrizek-WMF
DuplicateNone
OpenNone
ResolvedTrizek-WMF
DuplicateNone
ResolvedSgs
ResolvedSgs
ResolvedTrizek-WMF
OpenNone
ResolvedMimurawil
ResolvedTchanders
In ProgressNone
OpenNone
DeclinedNone
ResolvedEsanders
ResolvedEsanders
ResolvedSTei-WMF
Resolvedzoe
Resolvedzoe
OpenNone
ResolvedEsanders
ResolvedNone
ResolvedDLynch
OpenDLynch
ResolvedUrbanecm_WMF
ResolvedDLynch
ResolvedDLynch
OpenEsanders
ResolvedRyasmeen
ResolvedUrbanecm_WMF
ResolvedDLynch
ResolvedTrizek-WMF
Resolvedzoe
ResolvedRyasmeen
ResolvedBUG REPORTEtonkovidova
ResolvedTrizek-WMF
ResolvedNone
ResolvedPRODUCTION ERRORhubaishan
ResolvedTrizek-WMF
ResolvedDLynch
ResolvedTrizek-WMF
Resolvedppelberg
ResolvedQuiddity
Resolvedppelberg
Resolvedzoe
Resolvedzoe
ResolvedRyasmeen
Resolvedzoe
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedmatmarex
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The LQT->LQT redirects in that list are fine; they are what normally happens when a thread is moved in LQT. The one LQT->Flow list was a case I hadn't forseen where double redirect bots turned the previous case into a redirect to an invalid Flow topic. Since there was only one of them I undid the bot edit and returned it to the previous state; the topic in question is on a page that remains LQT today.

On the other hand, the very second entry of your list:

{
  "title": "Tópico:Wikilivros:Diálogos comunitários/Importar w:Encadernação artesanal",
  "to": "Topic:Pi71hsicrr6udelr"
},

That should have been skipped because https://pt.wikibooks.org/wiki/Topic:Pi71hsicrr6udelr isn't a valid Flow topic. I don't see any code to check that there.

That specific page is a nasty one because it has two topic redirects in its history, both of which are invalid. It's hard to tell because Flow's support for finding old topics is awful, but I don't think that topic was ever imported

For now I'm just grabbing the list of redirects, valid or otherwise - I did indeed realise as soon as I'd posted that I should have phrased that better. Processing every revision of everything in a namespace was the thing that really wanted behind me, but now I've got a json file of Flow topics that should load nice and quickly. It's end of day for me now, but hopefully by the end of tomorrow I'll have a list mapping from LQT pages to the most recent valid Flow topic, if it exists.

Here we go, a list of topics which exist:


(Sorry, but ...)

Not all Flow topics necessarily appear in the "flow" dump. Topics that were "hidden" are only found in the FlowHistory dump. I know this because a variant of this exact issue bit me when running Flow cleanup bot - see https://www.mediawiki.org/wiki/User_talk:Pppery#'Hidden'_topics

(I'm not sure whether topics that were deleted by a local admin appear in the FlowHistory dumps but even if they don't you probably don't care)

Oh, and that reminded me of another test case: Make sure https://pt.wikibooks.org/wiki/T%C3%B3pico:Utilizador_Discuss%C3%A3o:Helder.wiki/Re:_T%C3%B3picos_sobre_a_p%C3%A1gina_%22Wikip%C3%A9dia:Dicion%C3%A1rio/pt-AO%22 gets correctly pointed to https://pt.wikibooks.org/wiki/Topic:Pv0t4gx9sfpb67yc not https://pt.wikibooks.org/wiki/Topic:Pv0t4h52glek57o8 - there are about 10 of these manually-deleted cases, which would appear in the Flow APIs (and maybe the Flow dumps too - unclear) differently from the later script-deleted cases.

Hah, don't apologise – a lot of why I'm posting in so much detail to this thread is that it's important to get this right and it's reassuring that you're here to point me in the right direction. I'll run the other dump through this script tomorrow, with modifications as needed, and hopefully by the time you're online for the day I'll have a relatively convincing list of threads and the topics they should point at.

Right then, here we go!

import json
import bz2
import re

with bz2.open( 'redirects.json.bz2' ) as file:
    redirects = json.load( file )

with open( 'topics-history.txt' ) as file:
    topics = set( file.read().splitlines() )

# Strip the 'Topic:' namespace and '#anchor' suffixes to extract the database ID of a topic
def normaliseTopicTitle( topic ):
    match = re.match( r'Topic:([a-z0-9]*)', topic, re.I )
    return match[ 1 ].lower()

# Given a page entry, return the list of redirects into the `Topic:` namespace
def extractTopic( entry ):
    redirects = entry[ 'redirects' ]
    matches = filter( None, [ re.match( r'(Topic:[a-z0-9#]*)', redirect, re.I ) for redirect in redirects ] )
    return [ match[ 1 ] for match in matches ]

# Update a page entry, ensuring that the redirects list consists only of _valid_ redirects into Topic: namespace
def updatedRedirects( page ):
    result = page.copy()
    result[ 'redirects' ] = [ topic for topic in extractTopic( page ) if normaliseTopicTitle( topic ) in topics ]
    return result

# Filter redirects.json.bz2 for pages which once had a valid redirect, and do not currently have an active redirect
valid_redirects = [ updatedRedirects( redirect ) for redirect in redirects if redirect[ 'active' ] is None ]
entries = [ { 'title': redirect[ 'title' ], 'to': redirect[ 'redirects' ][ 0 ] }  for redirect in valid_redirects if len( redirect[ 'redirects' ] ) ]

with open( 'proposed-redirects.json', 'w' ) as file:
    json.dump( entries, file, ensure_ascii=False, indent=2 )

Here redirects.json.bz2 was generated by the script in an earlier comment above, using the output file similarly included above, with entries such as:

{
  "title": "Tópico:Wikilivros:Diálogos comunitários/Importar w:Encadernação artesanal",
  "id": "35579",
  "active": null,
  "redirects": [
    "Topic:Pi71hsicrr6udelr",
    "Topic:Pi71hsicrr6udelr#flow-post-pi71hsu7xm3hy87w",
    "Topic:Pi71hsicrr6udelr",
    "Topic:Pi71ht8j03w83jvb",
    "Topic:Pi71ht8j03w83jvb#flow-post-pi71ht5z0zkxemea"
  ]
}

Meanwhile topics-history.txt is the output of the topics extraction script above on the full history of flow topics. It's identical to the other one, I'll touch on that in a moment.

t3wm81bsfcch9gyv
plsnapky2n3gt7dq
plu9akz75d7kgjoj
pm3jmvoiwfalzdk0
pn326towg59zc9fk
pnr8l6kbz9rfegri
poid59ocehbqkud2
q3q7vulwhl5ejv9p
rqyf54kppohyz2es
tzist8lqne79mkod
...

Anyway, that results in the following list of redirects that could be fixed up:

That's 393 entries.

Here's your test case, looking correct:

{
  "title": "Tópico:Utilizador Discussão:Helder.wiki/Re: Tópicos sobre a página \"Wikipédia:Dicionário/pt-AO\"",
  "to": "Topic:Pv0t4gx9sfpb67yc"
}

There's two pages, Tópico:Utilizador Discussão:Helder.wiki/ArticleFeedback and Tópico:Utilizador Discussão:Helder.wiki/Tabelas wiki (2) that have active redirects to other things in the Tópicio: namespace and which also have associated Flow topics.

Moving on to the matter of Topics which have been moderated: I found that the dumps each had the same number of boards, topics and posts – only differing in the number of revisions. I found that the hidden state was attached to these revisions with modstate="hide" in the attributes. In any case: the output list of topics in both the extracted and full-history versions of the Flow database dump were identical. I spent some time poking at it, and I'm relatively satisfied that hidden state isn't going to bite us here. What do you think?

From here I think my available next steps are, in no particular order:

  • Write a script to set up these redirects
  • For each thread in namespace 90, identify its parent page and generate those lists of pages with LQT disabled, missing parents, etc

Sound reasonable?

Mostly unrelated to the above but I found that working with ElementTree was a lot more pleasant when I could have a look at the object in Jupyter: https://phabricator.wikimedia.org/P77955

In any case: the output list of topics in both the extracted and full-history versions of the Flow database dump were identical

Interesting, maybe I misremembered the problem I had faced a few months ago. Anyway, better safe than sorry.

A few of these cases are generating an invalid anchor #flow:

{
  "title": "Tópico:Wikilivros Discussão:Página principal/draft/Imagens fora de escala/resposta (2)",
  "to": "Topic:Plu9akz75d7kgjoj#flow"
},

This hardly matters since eventually the LQT thread namespace is going away and I doubt I'll end up running the Phase 2 Flow cleanup bot script which relies on these, but in the name of technical correctness and since it would be trivial (add - (which needs to be escaped) to the character class in extractTopic) you could keep the anchor.

There's two pages, Tópico:Utilizador Discussão:Helder.wiki/ArticleFeedback and Tópico:Utilizador Discussão:Helder.wiki/Tabelas wiki (2) that have active redirects to other things in the Tópicio: namespace and which also have associated Flow topics.

(Also hardly matters, but in the name of technical correctness ...) you should prefer the Flow topics there - the LQT redirects were imported to Flow as {{LQT moved thread stub converted to Flow}})

And yep, your plan sounds reasonable.

Good catch – I'd initially eaten the whole anchor and then thought better of it.

Change #1156959 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/extensions/Flow@master] Do not try to import pages that already redirect to Flow

https://gerrit.wikimedia.org/r/1156959

Pppery changed the subtype of this task from "Task" to "Bug Report".Jun 13 2025, 9:26 PM
Pppery updated Other Assignee, added: Pppery.
Pppery changed the subtype of this task from "Bug Report" to "Task".

Change #1157155 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/extensions/Flow@master] Ignore revisions by Flow talk page manager when importing LQT

https://gerrit.wikimedia.org/r/1157155

So, thinking about next steps (to address on Monday - I know it's the weekend but I tend to code at strange hours):

  • Set up the redirects in proposed-redirects.json
  • Find pages with lost or corrupt topics
  • Automatically remove {{#useliquidthreads:0}} from pages with lost topics
  • Manually handle any edge cases found in the above processes

Everything up to this point is uncontroversial technical cleanup that would have been a good idea even if Flow and LQT weren't being undeployed, and needs no community consultations.

  • Do what's actually described in the task title ("Inform ptwikibooks of LQT and Flow removal plan and timing")
  • Review and merge the patches above to make LQT->Flow importing handle the slightly-less-odd-than-before situation you've got to and either backport them or wait for the train.
  • Test on some individual pages with convertLqtPageOnLocalWiki.php on ptwikibooks and make sure it works as I expected it to (I've spent several hours attempting to set up similarly odd scenarios on my local MediaWiki instance and they seemed to work, but this whole thing is so strange that I would actually be surprised if there *weren't* more gremlins revealed).
  • Once satisfied, run convertAllLqtPages.php on ptwikibooks. Counter to my comments far above the standard way of running it should work and you now don't need --force-recovery-conversion (the script is smart enough to handle most cases of the LQT page already being /LQT Archive 1' and the few it can't are caught below)
  • Probably clean up a few more edge cases resulting from the conversion - I have a sinking feeling that the pages moved in the 2016 clean up attempt (https://pt.wikibooks.org/wiki/Especial:Registo/move/Flow_talk_page_manager?limit=17) may get handled in an especially bad way. Probably a few Flow boards will need to be moved around or deleted, which will be a pain because only WMF staff with Global Flow Creator rights can do it, but can be managed.
  • At this point all pages in the LQT "Thread:" namespace except a small subset previously identified as corrupt in the initial scan should be redirects. Review any that aren't and handle them as needed. (Secretly, there will be some, because at least one LQT page with several threads was deleted by local admins and the threads weren't. There may be others)
  • Once that's confirmed, freeze LQT on ptwikibooks.

Everything up to this point is things that the community has asked be done ever since Flow was created, but were never done properly, and would have been the end state if Flow weren't being undeployed. Then you have the standard steps that apply to any wiki with Flow:

  • Run FlowMoveBoardsToSubpages on ptwikibooks.
  • Make sure all of the Flow boards were properly moved and move those that weren't.
  • Mark Flow read-only on ptwikibooks.
  • Much later, consider various techniques for converting Flow to wikitext. If I end up running the bot I coded at https://gitlab.wikimedia.org/pppery/flow-export-with-history it should still even be able to find the LQT pages to extract signatures and such, although I will need to do minor adjustments to not percolate bot noise into the exported content. Other scripts that don't try to preserve history, like flow's builtin convertToText probably won't care.

I messed up: I assumed revisions were most-recent-first in the input dump, but they're least-recent-first. I'm now threading revision information through that code so that proposed-redirects.json will note the existing revision it's supposed to replace. I'll also check for outliers in regards to edit dates and edit authors on the revisions whose redirects I'm restoring.

I can't access that file. See https://mediawiki.org/wiki/Phabricator/Help#File_visibility

For dates and authors, the revisions restored should all be by Flow talk page manager and from some time in late 2015.

Sorry about that, fixed.

Now that I can attach more data to the proposal file, I can start checking data in other ways. For example, I found exactly one page with a fixable redirect whose most recent revision was not caused by Flow talk page manager:

'\n'.join([ page[ 'title' ] for page in redirects if buildProposedRedirect( page ) and page[ 'revisions' ][ -1 ][ 'username' ] != 'Flow talk page manager' ])

Tópico:Wikilivros Discussão:Portal comunitário/Proposta para converter as páginas de LiquidThreads para Flow

Do we want that in or out of the redirect fix? It turns out that's an inadvertent edit plus a rollback to the Talk page manager edit, so that's okay.

I've also confirmed that every proposed restored redirect was one created by the talk page manager, and that each one was created between the 2nd and 25th of November 2015.

Anyway, hopefully this code is a little more readable. It results in 393 redirects, which is the same as the last run which chose old redirects.

import json
import bz2
import re

outputfile = 'proposed-redirects.json'

with bz2.open( 'redirects.json.bz2' ) as file:
    redirects = json.load( file )

with open( 'topics-history.txt' ) as file:
    topics = set( file.read().splitlines() )

# Strip the 'Topic:' namespace and '#anchor' suffixes to extract the database ID of a topic
def normaliseTopicTitle( topic ):
    match = re.match( r'Topic:([a-z0-9]*)', topic, re.I )
    return match[ 1 ].lower()

def revisionHasValidRedirect( revision ):
    redirect = revision[ 'redirect' ]
    if redirect is None:
        return False
    if re.match( r'(Topic:[a-z0-9#-]*)', redirect, re.I ) is None:
        return False
    return normaliseTopicTitle( redirect ) in topics

def extractRevisionToRestore( page ):
    options = [ revision for revision in page[ 'revisions' ] if revisionHasValidRedirect( revision ) ]
    if len( options ) == 0:
        return None
    if page[ 'active' ] is not None:
        #print( f'Page has active redirect: { page[ 'title' ] } -> { page[ 'active' ] }' )
        return None
    return options[ -1 ]

def buildProposedRedirect( page ):
    toRestore = extractRevisionToRestore( page )
    if toRestore is None:
        return None
    replacing = page[ 'revisions' ][ -1 ]
    return {
        'title': page[ 'title' ],
        'id': page[ 'id' ],
        'to': toRestore[ 'redirect' ],
        'expected': replacing[ 'id' ],
    }

with open( outputfile, 'w' ) as file:
    entries = list( filter( None, ( buildProposedRedirect( page ) for page in redirects ) ) )
    json.dump( entries, file, ensure_ascii=False, indent=2 )

Our canary from the last time I did this still looks sensible:

{
  "title": "Tópico:Utilizador Discussão:Helder.wiki/Re: Tópicos sobre a página \"Wikipédia:Dicionário/pt-AO\"",
  "id": "40534",
  "to": "Topic:Pv0t4gx9sfpb67yc",
  "expected": "425611"
},

I'm not clear on which account I should use to make these changes. Having looked at bot policy, I've definitely got to use a bot account for this. However, would Flow talk page manager be the appropriate account to use? Is that something I can do while writing a script that calls the API, or is this the moment I have to go and learn how to do this in PHP?

Edit: looks like there's internal processes for this, doing those...

I'm most of the way to a list of root pages and proposed edits to remove {{#useliquidthreads:0}}, hopefully I'll have some code and output ready to show soon, along with a list of broken threads. I'll confirm there's nothing in namespace 90 to exclude overlap with the redirect fixups (and to sanity check).

Also: I've requested both flood rights and Global Flow Creator rights, so from here it should be a matter of waiting for that and building a script to execute on those.

I suppose in both cases I'm making proposed edits, and I'm going to use the existing revision ID as a check to make sure that I don't make a change twice. With a bit of luck I can then use the same bulk-edit script for both tasks.

Change #1156959 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Do not try to import pages that already redirect to Flow

https://gerrit.wikimedia.org/r/1156959

Change #1157155 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Ignore revisions by Flow talk page manager when importing LQT

https://gerrit.wikimedia.org/r/1157155

I've received a gentle suggestion to create some subtasks instead of doing everything in this ticket, so...

(tracking moved to ticket description)

I'm most of the way there, at present I'm refactoring my codebase into bare python scripts instead of jupyter notebooks and making sure there's a readme – I'm sure nobody else will ever actually need to run them for real, but it's important to have a record of how I created these files, to have a spot of code review, and because it's probably helpful to have examples of how to go spelunking through multi-gigabyte XML files without running out of memory.

because it's probably helpful to have examples of how to go spelunking through multi-gigabyte XML files without running out of memory

I've had to do that before myself when working on https://en.wikipedia.org/wiki/User:Pppery/Bad_blobs. It was a pain.

@ppelberg – The parent task of this is the general LQT deprecation thread. Would it be helpful to rename this to something like "ptwikibooks LQT deprecation"?

@Pppery – Is the manual move to archival subpages necessary? I was under the impression that old LQT pages will be moved by the LQT script, and then likewise for the Flow pages. If it is, does it block running the script?

Is the manual move to archival subpages necessary?

I don't know what you are referring to.

I'd like to hand this ticket to @Trizek-WMF to coordinate with the community for the steps after we've fixed things up ready to run the conversion script, per your suggestion in https://phabricator.wikimedia.org/T388099#10914958, and so I'm also trying to make sure that the current task description makes sense.

Right now it starts:

The removal plan for both extensions will happen as follows:

  1. LQT pages are moved to sub-pages for archiving
  2. All Flow boards are moved to sub-pages
  3. LQT pages are converted to Flow

I'm referring to item 1 here. The parent task, T385290, implies that this is a manual operation done by volunteers, but that doesn't seem practical for ptwikibooks. Meanwhile, my understanding is that the LQT page gets renamed by convertAllLqtPages.php automatically, not that it'll matter once LQT is undeployed... and that then when we do the Flow conversion the Flow boards will _also_ get moved, leaving talk pages as normal discussion tools pages with links to Flow pages. I'd like to make sure I'm right on that, because I'm confused why this step appears in the task description.

No, there is no need to manually move LQT pages before anything is done there.

It's also worth pointing out that the order we decided to do things in is different from the order in the task description; I thought we had agreed on:

  1. Cleanup the failed migrations
  2. Migrate LQT to Flow
  3. Review the migration to make sure it makes sense
  4. Move Flow boards to subpages

Where as right now we have "Move Flow boards to subpages" first in the task description.

Mm, we basically had a set of steps laid out twice, first incorrectly and then correctly. I've just deleted the first half of the task description.

@Pppery am I right in thinking that we don't need to update https://pt.wikibooks.org/wiki/Predefini%C3%A7%C3%A3o:Boas-vindas after running the script, or does setting $wmgLiquidThreadsFrozen to false not suffice?

In theory just freezing LQT should work, assuming you only run the script once. It's probably a good idea anyway to remove that code once the bulk fixes are done, since that was one of the causes of the original mess.

Just in case it's needed, here's a copy (from today) of all the LQT threads that the API returns:

Phew! I think I may have broken this down into too many tickets.

I've now reparented this ticket under T402545: ptwikibooks LiquidThreads sunsetting, which also tracks the execution tickets.

@Pppery – I'm sorry, I must have spammed you so much today. Could you let me know if this looks sensible and in accordance with your expectations?

I get dozens of Phabricator notifications every day (because I watch so many projects), so the spam was only slightly more than usual. Otherwise you seem to be on the right track.

What is the status of this? There's been a lot of internal reorganizing but it seems no actual progress for a while.

@Pppery I've written a community notice, we're reviewing it internally and are hoping to have it translated and up on ptwikibooks next week. I'm currently proposing that we give a weeks' notice and then execute everything in one go, unless the community wants us to make changes to that schedule.

Change #1206828 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Flow@master] Fix "undefined array key" error in PageRevisionedObject

https://gerrit.wikimedia.org/r/1206828

Change #1206828 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Fix "undefined array key" error in PageRevisionedObject

https://gerrit.wikimedia.org/r/1206828

Change #1207900 had a related patch set uploaded (by Reedy; author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Flow@REL1_45] Fix "undefined array key" error in PageRevisionedObject

https://gerrit.wikimedia.org/r/1207900

Change #1207900 merged by jenkins-bot:

[mediawiki/extensions/Flow@REL1_45] Fix "undefined array key" error in PageRevisionedObject

https://gerrit.wikimedia.org/r/1207900